Re: [DISCUSS] Portability representation of schemas

Kenneth Knowles Thu, 09 May 2019 08:53:14 -0700

*From: *Robert Bradshaw <rober...@google.com>
*Date: *Thu, May 9, 2019 at 7:48 AM
*To: *dev


From: Kenneth Knowles <k...@apache.org>
> Date: Thu, May 9, 2019 at 10:05 AM
> To: dev
>
> > This is a huge development. Top posting because I can be more compact.
> >
> > I really think after the initial idea converges this needs a design doc
> with goals and alternatives. It is an extraordinarily consequential model
> change. So in the spirit of doing the work / bias towards action, I created
> a quick draft at https://s.apache.org/beam-schemas and added everyone on
> this thread as editors. I am still in the process of writing this to match
> the thread.
>
> Thanks! Added some comments there.
>
> > *Multiple timestamp resolutions*: you can use logcial types to represent
> nanos the same way Java and proto do.
>
> As per the other discussion, I'm unsure the value in supporting
> multiple timestamp resolutions is high enough to outweigh the cost.
>

Yea, still under discussion exactly what to do here. This is an urgent
problem for SQL - our existing virtual tables have two choices (sometimes
configurable, sometimes not): crash when you see a high precision
timestamp, or lose data. At least making SQL timestamps (and the rest of
the 12 date/time types in SQL...) a logical type "ROW { seconds, nanos }"
seems like it will be necessary in the short term, so any underlying
"TIMESTAMP" is relevant primarily because "GROUP BY TUMBLE(...)" and
similar might have to jump through hoops to use it. Calcite's codegen also
has a hardcoded assumption of millis, unfortunately.

> *Why multiple int types?* The domain of values for these types are
> different. For a language with one "int" or "number" type, that's another
> domain of values.
>
> What is the value in having different domains? If your data has a
> natural domain, chances are it doesn't line up exactly with one of
> these. I guess it's for languages whose types have specific domains?
> (There's also compactness in representation, encoded and in-memory,
> though I'm not sure that's high.)
>

Are you asking why have int16, int32, in64 as opposed to a single domain of
"integers"? Most languages have some of these types so it is a pretty
natural fit. They also can have a fixed width encoding; I'm not expert in
whether that becomes important for columnar batches.

> *Columnar/Arrow*: making sure we unlock the ability to take this path is
> Paramount. So tying it directly to a row-oriented coder seems
> counterproductive.
>
> I don't think Coders are necessarily row-oriented. They are, however,
> bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
> overlap between what Coders express in terms of element typing
> information and what Schemas express, and I'd rather have one concept
> if possible. Or have a clear division of responsibilities.
>

A coder is more-or-less a function from element -> bytes. Do you have a
different idea? Like using coders just as a type declaration and having the
SDK/runner have a second interface that it interacts with?


> > *Multimap*: what does it add over an array-valued map or
> large-iterable-valued map? (honest question, not rhetorical)
>
> Multimap has a different notion of what it means to contain a value,
> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
> isn't worth a new primitive type.


I guess it might come down to whether MultiMap<k, v> ::= Map<k,
Iterable<v>> as a logical type is efficient or merits a different encoding.
No strong opinion.


> > *URN/enum for type names*: I see the case for both. The core types are
> fundamental enough they should never really change - after all, proto,
> thrift, avro, arrow, have addressed this (not to mention most programming
> languages). Maybe additions once every few years. I prefer the smallest
> intersection of these schema languages. A oneof is more clear, while URN
> emphasizes the similarity of built-in and logical types.
>
> Hmm... Do we have any examples of the multi-level primitive/logical
> type in any of these other systems?


Yes, I'd say it is the rule not the exception:
https://github.com/protocolbuffers/protobuf/blob/d9ccd0c0e6bbda9bf4476088eeb46b02d7dcd327/java/compatibility_tests/v2.5.0/more_protos/src/proto/google/protobuf/descriptor.proto#L104


> I have a bias towards all types
> being on the same footing unless there is compelling reason to divide
> things into primitive/use-defined ones.
>

To be clear, my understanding here is that this an AST representation
question, not an expressivity or user-facing API question. I don't think
URNs vs oneof affects the universe of schemas, how their values are
embedded in specific languages, and how they are encoded. Today the
difference is front-and-center in Java but that is not fundamental and we
could come up with an in-Java representation that made all types look
equivalent to users. Now, the choice of what goes in the oneof and which
URNs to standardize is a different and one of the biggest decisions. I just
meant to comment on the minor issue.

Kenn


> Here it seems like the most essential value of the primitive type set
> is to describe the underlying representation, for encoding elements in
> a variety of ways (notably columnar, but also interfacing with other
> external systems like IOs). Perhaps, rather than the previous
> suggestion of making everything a logical of bytes, this could be made
> clear by still making everything a logical type, but renaming
> "TypeName" to Representation. There would be URNs (typically with
> empty payloads) for the various primitive types (whose mapping to
> their representations would be the identity).
>
> - Robert
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to