Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Thu, 09 May 2019 13:41:02 -0700

From: Reuven Lax <re...@google.com>
Date: Thu, May 9, 2019 at 7:29 PM
To: dev


> Also in the future we might be able to do optimizations at the runner level 
> if at the portability layer we understood schemes instead of just raw coders. 
> This could be things like only parsing a subset of a row (if we know only a 
> few fields are accessed) or using a columnar data structure like Arrow to 
> encode batches of rows across portability. This doesn't affect data semantics 
> of course, but having a richer, more-expressive type system opens up other 
> opportunities.

But we could do all of that with a RowCoder we understood to designate
the type(s), right?

> On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <rober...@google.com> wrote:
>>
>> On the flip side, Schemas are equivalent to the space of Coders with
>> the addition of a RowCoder and the ability to materialize to something
>> other than bytes, right? (Perhaps I'm missing something big here...)
>> This may make a backwards-compatible transition easier. (SDK-side, the
>> ability to reason about and operate on such types is of course much
>> richer than anything Coders offer right now.)
>>
>> From: Reuven Lax <re...@google.com>
>> Date: Thu, May 9, 2019 at 4:52 PM
>> To: dev
>>
>> > FYI I can imagine a world in which we have no coders. We could define the 
>> > entire model on top of schemas. Today's "Coder" is completely equivalent 
>> > to a single-field schema with a logical-type field (actually the latter is 
>> > slightly more expressive as you aren't forced to serialize into bytes).
>> >
>> > Due to compatibility constraints and the effort that would be  involved in 
>> > such a change, I think the practical decision should be for schemas and 
>> > coders to coexist for the time being. However when we start planning Beam 
>> > 3.0, deprecating coders is something I would like to suggest.
>> >
>> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <rober...@google.com> wrote:
>> >>
>> >> From: Kenneth Knowles <k...@apache.org>
>> >> Date: Thu, May 9, 2019 at 10:05 AM
>> >> To: dev
>> >>
>> >> > This is a huge development. Top posting because I can be more compact.
>> >> >
>> >> > I really think after the initial idea converges this needs a design doc 
>> >> > with goals and alternatives. It is an extraordinarily consequential 
>> >> > model change. So in the spirit of doing the work / bias towards action, 
>> >> > I created a quick draft at https://s.apache.org/beam-schemas and added 
>> >> > everyone on this thread as editors. I am still in the process of 
>> >> > writing this to match the thread.
>> >>
>> >> Thanks! Added some comments there.
>> >>
>> >> > *Multiple timestamp resolutions*: you can use logcial types to 
>> >> > represent nanos the same way Java and proto do.
>> >>
>> >> As per the other discussion, I'm unsure the value in supporting
>> >> multiple timestamp resolutions is high enough to outweigh the cost.
>> >>
>> >> > *Why multiple int types?* The domain of values for these types are 
>> >> > different. For a language with one "int" or "number" type, that's 
>> >> > another domain of values.
>> >>
>> >> What is the value in having different domains? If your data has a
>> >> natural domain, chances are it doesn't line up exactly with one of
>> >> these. I guess it's for languages whose types have specific domains?
>> >> (There's also compactness in representation, encoded and in-memory,
>> >> though I'm not sure that's high.)
>> >>
>> >> > *Columnar/Arrow*: making sure we unlock the ability to take this path 
>> >> > is Paramount. So tying it directly to a row-oriented coder seems 
>> >> > counterproductive.
>> >>
>> >> I don't think Coders are necessarily row-oriented. They are, however,
>> >> bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
>> >> overlap between what Coders express in terms of element typing
>> >> information and what Schemas express, and I'd rather have one concept
>> >> if possible. Or have a clear division of responsibilities.
>> >>
>> >> > *Multimap*: what does it add over an array-valued map or 
>> >> > large-iterable-valued map? (honest question, not rhetorical)
>> >>
>> >> Multimap has a different notion of what it means to contain a value,
>> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
>> >> isn't worth a new primitive type.
>> >>
>> >> > *URN/enum for type names*: I see the case for both. The core types are 
>> >> > fundamental enough they should never really change - after all, proto, 
>> >> > thrift, avro, arrow, have addressed this (not to mention most 
>> >> > programming languages). Maybe additions once every few years. I prefer 
>> >> > the smallest intersection of these schema languages. A oneof is more 
>> >> > clear, while URN emphasizes the similarity of built-in and logical 
>> >> > types.
>> >>
>> >> Hmm... Do we have any examples of the multi-level primitive/logical
>> >> type in any of these other systems? I have a bias towards all types
>> >> being on the same footing unless there is compelling reason to divide
>> >> things into primitive/use-defined ones.
>> >>
>> >> Here it seems like the most essential value of the primitive type set
>> >> is to describe the underlying representation, for encoding elements in
>> >> a variety of ways (notably columnar, but also interfacing with other
>> >> external systems like IOs). Perhaps, rather than the previous
>> >> suggestion of making everything a logical of bytes, this could be made
>> >> clear by still making everything a logical type, but renaming
>> >> "TypeName" to Representation. There would be URNs (typically with
>> >> empty payloads) for the various primitive types (whose mapping to
>> >> their representations would be the identity).
>> >>
>> >> - Robert

Re: [DISCUSS] Portability representation of schemas

Reply via email to