From: Kenneth Knowles <k...@apache.org> Date: Thu, May 9, 2019 at 10:05 AM To: dev
> This is a huge development. Top posting because I can be more compact. > > I really think after the initial idea converges this needs a design doc with > goals and alternatives. It is an extraordinarily consequential model change. > So in the spirit of doing the work / bias towards action, I created a quick > draft at https://s.apache.org/beam-schemas and added everyone on this thread > as editors. I am still in the process of writing this to match the thread. Thanks! Added some comments there. > *Multiple timestamp resolutions*: you can use logcial types to represent > nanos the same way Java and proto do. As per the other discussion, I'm unsure the value in supporting multiple timestamp resolutions is high enough to outweigh the cost. > *Why multiple int types?* The domain of values for these types are different. > For a language with one "int" or "number" type, that's another domain of > values. What is the value in having different domains? If your data has a natural domain, chances are it doesn't line up exactly with one of these. I guess it's for languages whose types have specific domains? (There's also compactness in representation, encoded and in-memory, though I'm not sure that's high.) > *Columnar/Arrow*: making sure we unlock the ability to take this path is > Paramount. So tying it directly to a row-oriented coder seems > counterproductive. I don't think Coders are necessarily row-oriented. They are, however, bytes-oriented. (Perhaps they need not be.) There seems to be a lot of overlap between what Coders express in terms of element typing information and what Schemas express, and I'd rather have one concept if possible. Or have a clear division of responsibilities. > *Multimap*: what does it add over an array-valued map or > large-iterable-valued map? (honest question, not rhetorical) Multimap has a different notion of what it means to contain a value, can handle (unordered) unions of non-disjoint keys, etc. Maybe this isn't worth a new primitive type. > *URN/enum for type names*: I see the case for both. The core types are > fundamental enough they should never really change - after all, proto, > thrift, avro, arrow, have addressed this (not to mention most programming > languages). Maybe additions once every few years. I prefer the smallest > intersection of these schema languages. A oneof is more clear, while URN > emphasizes the similarity of built-in and logical types. Hmm... Do we have any examples of the multi-level primitive/logical type in any of these other systems? I have a bias towards all types being on the same footing unless there is compelling reason to divide things into primitive/use-defined ones. Here it seems like the most essential value of the primitive type set is to describe the underlying representation, for encoding elements in a variety of ways (notably columnar, but also interfacing with other external systems like IOs). Perhaps, rather than the previous suggestion of making everything a logical of bytes, this could be made clear by still making everything a logical type, but renaming "TypeName" to Representation. There would be URNs (typically with empty payloads) for the various primitive types (whose mapping to their representations would be the identity). - Robert