Re: [DISCUSS] Portability representation of schemas

Reuven Lax Thu, 09 May 2019 10:32:12 -0700

Also in the future we might be able to do optimizations at the runner level
if at the portability layer we understood schemes instead of just raw
coders. This could be things like only parsing a subset of a row (if we
know only a few fields are accessed) or using a columnar data structure
like Arrow to encode batches of rows across portability. This doesn't
affect data semantics of course, but having a richer, more-expressive type
system opens up other opportunities.


On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <[email protected]> wrote:

> On the flip side, Schemas are equivalent to the space of Coders with
> the addition of a RowCoder and the ability to materialize to something
> other than bytes, right? (Perhaps I'm missing something big here...)
> This may make a backwards-compatible transition easier. (SDK-side, the
> ability to reason about and operate on such types is of course much
> richer than anything Coders offer right now.)
>
> From: Reuven Lax <[email protected]>
> Date: Thu, May 9, 2019 at 4:52 PM
> To: dev
>
> > FYI I can imagine a world in which we have no coders. We could define
> the entire model on top of schemas. Today's "Coder" is completely
> equivalent to a single-field schema with a logical-type field (actually the
> latter is slightly more expressive as you aren't forced to serialize into
> bytes).
> >
> > Due to compatibility constraints and the effort that would be  involved
> in such a change, I think the practical decision should be for schemas and
> coders to coexist for the time being. However when we start planning Beam
> 3.0, deprecating coders is something I would like to suggest.
> >
> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> From: Kenneth Knowles <[email protected]>
> >> Date: Thu, May 9, 2019 at 10:05 AM
> >> To: dev
> >>
> >> > This is a huge development. Top posting because I can be more compact.
> >> >
> >> > I really think after the initial idea converges this needs a design
> doc with goals and alternatives. It is an extraordinarily consequential
> model change. So in the spirit of doing the work / bias towards action, I
> created a quick draft at https://s.apache.org/beam-schemas and added
> everyone on this thread as editors. I am still in the process of writing
> this to match the thread.
> >>
> >> Thanks! Added some comments there.
> >>
> >> > *Multiple timestamp resolutions*: you can use logcial types to
> represent nanos the same way Java and proto do.
> >>
> >> As per the other discussion, I'm unsure the value in supporting
> >> multiple timestamp resolutions is high enough to outweigh the cost.
> >>
> >> > *Why multiple int types?* The domain of values for these types are
> different. For a language with one "int" or "number" type, that's another
> domain of values.
> >>
> >> What is the value in having different domains? If your data has a
> >> natural domain, chances are it doesn't line up exactly with one of
> >> these. I guess it's for languages whose types have specific domains?
> >> (There's also compactness in representation, encoded and in-memory,
> >> though I'm not sure that's high.)
> >>
> >> > *Columnar/Arrow*: making sure we unlock the ability to take this path
> is Paramount. So tying it directly to a row-oriented coder seems
> counterproductive.
> >>
> >> I don't think Coders are necessarily row-oriented. They are, however,
> >> bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
> >> overlap between what Coders express in terms of element typing
> >> information and what Schemas express, and I'd rather have one concept
> >> if possible. Or have a clear division of responsibilities.
> >>
> >> > *Multimap*: what does it add over an array-valued map or
> large-iterable-valued map? (honest question, not rhetorical)
> >>
> >> Multimap has a different notion of what it means to contain a value,
> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
> >> isn't worth a new primitive type.
> >>
> >> > *URN/enum for type names*: I see the case for both. The core types
> are fundamental enough they should never really change - after all, proto,
> thrift, avro, arrow, have addressed this (not to mention most programming
> languages). Maybe additions once every few years. I prefer the smallest
> intersection of these schema languages. A oneof is more clear, while URN
> emphasizes the similarity of built-in and logical types.
> >>
> >> Hmm... Do we have any examples of the multi-level primitive/logical
> >> type in any of these other systems? I have a bias towards all types
> >> being on the same footing unless there is compelling reason to divide
> >> things into primitive/use-defined ones.
> >>
> >> Here it seems like the most essential value of the primitive type set
> >> is to describe the underlying representation, for encoding elements in
> >> a variety of ways (notably columnar, but also interfacing with other
> >> external systems like IOs). Perhaps, rather than the previous
> >> suggestion of making everything a logical of bytes, this could be made
> >> clear by still making everything a logical type, but renaming
> >> "TypeName" to Representation. There would be URNs (typically with
> >> empty payloads) for the various primitive types (whose mapping to
> >> their representations would be the identity).
> >>
> >> - Robert
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to