Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Thu, 09 May 2019 07:49:03 -0700

From: Kenneth Knowles <k...@apache.org>
Date: Thu, May 9, 2019 at 10:05 AM
To: dev


> This is a huge development. Top posting because I can be more compact.
>
> I really think after the initial idea converges this needs a design doc with 
> goals and alternatives. It is an extraordinarily consequential model change. 
> So in the spirit of doing the work / bias towards action, I created a quick 
> draft at https://s.apache.org/beam-schemas and added everyone on this thread 
> as editors. I am still in the process of writing this to match the thread.

Thanks! Added some comments there.

> *Multiple timestamp resolutions*: you can use logcial types to represent 
> nanos the same way Java and proto do.

As per the other discussion, I'm unsure the value in supporting
multiple timestamp resolutions is high enough to outweigh the cost.

> *Why multiple int types?* The domain of values for these types are different. 
> For a language with one "int" or "number" type, that's another domain of 
> values.

What is the value in having different domains? If your data has a
natural domain, chances are it doesn't line up exactly with one of
these. I guess it's for languages whose types have specific domains?
(There's also compactness in representation, encoded and in-memory,
though I'm not sure that's high.)

> *Columnar/Arrow*: making sure we unlock the ability to take this path is 
> Paramount. So tying it directly to a row-oriented coder seems 
> counterproductive.

I don't think Coders are necessarily row-oriented. They are, however,
bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
overlap between what Coders express in terms of element typing
information and what Schemas express, and I'd rather have one concept
if possible. Or have a clear division of responsibilities.

> *Multimap*: what does it add over an array-valued map or 
> large-iterable-valued map? (honest question, not rhetorical)

Multimap has a different notion of what it means to contain a value,
can handle (unordered) unions of non-disjoint keys, etc. Maybe this
isn't worth a new primitive type.

> *URN/enum for type names*: I see the case for both. The core types are 
> fundamental enough they should never really change - after all, proto, 
> thrift, avro, arrow, have addressed this (not to mention most programming 
> languages). Maybe additions once every few years. I prefer the smallest 
> intersection of these schema languages. A oneof is more clear, while URN 
> emphasizes the similarity of built-in and logical types.

Hmm... Do we have any examples of the multi-level primitive/logical
type in any of these other systems? I have a bias towards all types
being on the same footing unless there is compelling reason to divide
things into primitive/use-defined ones.

Here it seems like the most essential value of the primitive type set
is to describe the underlying representation, for encoding elements in
a variety of ways (notably columnar, but also interfacing with other
external systems like IOs). Perhaps, rather than the previous
suggestion of making everything a logical of bytes, this could be made
clear by still making everything a logical type, but renaming
"TypeName" to Representation. There would be URNs (typically with
empty payloads) for the various primitive types (whose mapping to
their representations would be the identity).

- Robert

Re: [DISCUSS] Portability representation of schemas

Reply via email to