Re: [DISCUSS] Portability representation of schemas

Brian Hulette Fri, 24 May 2019 11:42:56 -0700

*tl;dr:* SchemaCoder represents a logical type with a base type of Row and
we should think about that.


I'm a little concerned that the current proposals for a portable
representation don't actually fully represent Schemas. It seems to me that
the current java-only Schemas are made up three concepts that are
intertwined:
(a) The Java SDK specific code for schema inference, type coercion, and
"schema-aware" transforms.
(b) A RowCoder[1] that encodes Rows[2] which have a particular Schema[3].
(c) A SchemaCoder[4] that has a RowCoder for a particular schema, and
functions for converting Rows with that schema to/from a Java type T. Those
functions and the RowCoder are then composed to provider a Coder for the
type T.

We're not concerned with (a) at this time since that's specific to the SDK,
not the interface between them. My understanding is we just want to define
a portable representation for (b) and/or (c).

What has been discussed so far is really just a portable representation for
(b), the RowCoder, since the discussion is only around how to represent the
schema itself and not the to/from functions. One of the outstanding
questions for that schema representation is how to represent logical types,
which may or may not have some language type in each SDK (the canonical
example being a timsetamp type with seconds and nanos and
java.time.Instant). I think this question is critically important, because
(c), the SchemaCoder, is actually *defining a logical type* with a language
type T in the Java SDK. This becomes clear when you compare SchemaCoder[4]
to the Schema.LogicalType interface[5] - both essentially have three
attributes: a base type, and two functions for converting to/from that base
type. The only difference is for SchemaCoder that base type must be a Row
so it can be represented by a Schema alone, while LogicalType can have any
base type that can be represented by FieldType, including a Row.

I think it may make sense to fully embrace this duality, by letting
SchemaCoder have a baseType other than just Row and renaming it to
LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
transforms (a) would operate only on LogicalTypeCoders with a Row base
type. Perhaps some of the current schema logic could  alsobe applied more
generally to any logical type  - for example, to provide type coercion for
logical types with a base type other than Row, like int64 and a timestamp
class backed by millis, or fixed size bytes and a UUID class. And having a
portable representation that represents those (non Row backed) logical
types with some URN would also allow us to pass them to other languages
without unnecessarily wrapping them in a Row in order to use SchemaCoder.

I've added a section to the doc [6] to propose this alternative in the
context of the portable representation but I wanted to bring it up here as
well to solicit feedback.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
[3]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
[4]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
[5]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
[6]
https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin

On Fri, May 10, 2019 at 9:16 AM Brian Hulette <bhule...@google.com> wrote:

> Ah thanks! I added some language there.
>
> *From: *Kenneth Knowles <k...@apache.org>
> *Date: *Thu, May 9, 2019 at 5:31 PM
> *To: *dev
>
>
>> *From: *Brian Hulette <bhule...@google.com>
>> *Date: *Thu, May 9, 2019 at 2:02 PM
>> *To: * <dev@beam.apache.org>
>>
>> We briefly discussed using arrow schemas in place of beam schemas
>>> entirely in an arrow thread [1]. The biggest reason not to this was that we
>>> wanted to have a type for large iterables in beam schemas. But given that
>>> large iterables aren't currently implemented, beam schemas look very
>>> similar to arrow schemas.
>>>
>>
>>
>>> I think it makes sense to take inspiration from arrow schemas where
>>> possible, and maybe even copy them outright. Arrow already has a portable
>>> (flatbuffers) schema representation [2], and implementations for it in many
>>> languages that we may be able to re-use as we bring schemas to more SDKs
>>> (the project has Python and Go implementations). There are a couple of
>>> concepts in Arrow schemas that are specific for the format and wouldn't
>>> make sense for us, (fields can indicate whether or not they are dictionary
>>> encoded, and the schema has an endianness field), but if you drop those
>>> concepts the arrow spec looks pretty similar to the beam proto spec.
>>>
>>
>> FWIW I left a blank section in the doc for filling out what the
>> differences are and why, and conversely what the interop opportunities may
>> be. Such sections are some of my favorite sections of design docs.
>>
>> Kenn
>>
>>
>> Brian
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>> [2] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>
>>> *From: *Robert Bradshaw <rober...@google.com>
>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>> *To: *dev
>>>
>>> From: Reuven Lax <re...@google.com>
>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>> To: dev
>>>>
>>>> > Also in the future we might be able to do optimizations at the runner
>>>> level if at the portability layer we understood schemes instead of just raw
>>>> coders. This could be things like only parsing a subset of a row (if we
>>>> know only a few fields are accessed) or using a columnar data structure
>>>> like Arrow to encode batches of rows across portability. This doesn't
>>>> affect data semantics of course, but having a richer, more-expressive type
>>>> system opens up other opportunities.
>>>>
>>>> But we could do all of that with a RowCoder we understood to designate
>>>> the type(s), right?
>>>>
>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>> >>
>>>> >> On the flip side, Schemas are equivalent to the space of Coders with
>>>> >> the addition of a RowCoder and the ability to materialize to
>>>> something
>>>> >> other than bytes, right? (Perhaps I'm missing something big here...)
>>>> >> This may make a backwards-compatible transition easier. (SDK-side,
>>>> the
>>>> >> ability to reason about and operate on such types is of course much
>>>> >> richer than anything Coders offer right now.)
>>>> >>
>>>> >> From: Reuven Lax <re...@google.com>
>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>> >> To: dev
>>>> >>
>>>> >> > FYI I can imagine a world in which we have no coders. We could
>>>> define the entire model on top of schemas. Today's "Coder" is completely
>>>> equivalent to a single-field schema with a logical-type field (actually the
>>>> latter is slightly more expressive as you aren't forced to serialize into
>>>> bytes).
>>>> >> >
>>>> >> > Due to compatibility constraints and the effort that would be
>>>> involved in such a change, I think the practical decision should be for
>>>> schemas and coders to coexist for the time being. However when we start
>>>> planning Beam 3.0, deprecating coders is something I would like to suggest.
>>>> >> >
>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>> rober...@google.com> wrote:
>>>> >> >>
>>>> >> >> From: Kenneth Knowles <k...@apache.org>
>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>> >> >> To: dev
>>>> >> >>
>>>> >> >> > This is a huge development. Top posting because I can be more
>>>> compact.
>>>> >> >> >
>>>> >> >> > I really think after the initial idea converges this needs a
>>>> design doc with goals and alternatives. It is an extraordinarily
>>>> consequential model change. So in the spirit of doing the work / bias
>>>> towards action, I created a quick draft at
>>>> https://s.apache.org/beam-schemas and added everyone on this thread as
>>>> editors. I am still in the process of writing this to match the thread.
>>>> >> >>
>>>> >> >> Thanks! Added some comments there.
>>>> >> >>
>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types to
>>>> represent nanos the same way Java and proto do.
>>>> >> >>
>>>> >> >> As per the other discussion, I'm unsure the value in supporting
>>>> >> >> multiple timestamp resolutions is high enough to outweigh the
>>>> cost.
>>>> >> >>
>>>> >> >> > *Why multiple int types?* The domain of values for these types
>>>> are different. For a language with one "int" or "number" type, that's
>>>> another domain of values.
>>>> >> >>
>>>> >> >> What is the value in having different domains? If your data has a
>>>> >> >> natural domain, chances are it doesn't line up exactly with one of
>>>> >> >> these. I guess it's for languages whose types have specific
>>>> domains?
>>>> >> >> (There's also compactness in representation, encoded and
>>>> in-memory,
>>>> >> >> though I'm not sure that's high.)
>>>> >> >>
>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take
>>>> this path is Paramount. So tying it directly to a row-oriented coder seems
>>>> counterproductive.
>>>> >> >>
>>>> >> >> I don't think Coders are necessarily row-oriented. They are,
>>>> however,
>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be a
>>>> lot of
>>>> >> >> overlap between what Coders express in terms of element typing
>>>> >> >> information and what Schemas express, and I'd rather have one
>>>> concept
>>>> >> >> if possible. Or have a clear division of responsibilities.
>>>> >> >>
>>>> >> >> > *Multimap*: what does it add over an array-valued map or
>>>> large-iterable-valued map? (honest question, not rhetorical)
>>>> >> >>
>>>> >> >> Multimap has a different notion of what it means to contain a
>>>> value,
>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. Maybe
>>>> this
>>>> >> >> isn't worth a new primitive type.
>>>> >> >>
>>>> >> >> > *URN/enum for type names*: I see the case for both. The core
>>>> types are fundamental enough they should never really change - after all,
>>>> proto, thrift, avro, arrow, have addressed this (not to mention most
>>>> programming languages). Maybe additions once every few years. I prefer the
>>>> smallest intersection of these schema languages. A oneof is more clear,
>>>> while URN emphasizes the similarity of built-in and logical types.
>>>> >> >>
>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>> primitive/logical
>>>> >> >> type in any of these other systems? I have a bias towards all
>>>> types
>>>> >> >> being on the same footing unless there is compelling reason to
>>>> divide
>>>> >> >> things into primitive/use-defined ones.
>>>> >> >>
>>>> >> >> Here it seems like the most essential value of the primitive type
>>>> set
>>>> >> >> is to describe the underlying representation, for encoding
>>>> elements in
>>>> >> >> a variety of ways (notably columnar, but also interfacing with
>>>> other
>>>> >> >> external systems like IOs). Perhaps, rather than the previous
>>>> >> >> suggestion of making everything a logical of bytes, this could be
>>>> made
>>>> >> >> clear by still making everything a logical type, but renaming
>>>> >> >> "TypeName" to Representation. There would be URNs (typically with
>>>> >> >> empty payloads) for the various primitive types (whose mapping to
>>>> >> >> their representations would be the identity).
>>>> >> >>
>>>> >> >> - Robert
>>>>
>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to