Re: [DISCUSS] Portability representation of schemas

Brian Hulette Fri, 31 May 2019 15:40:49 -0700

> Can you propose what the protos would look like in this case? Right now
LogicalType does not contain the to/from conversion functions in the proto.
Do you think we'll need to add these in?


Maybe. Right now the proposed LogicalType message is pretty simple/generic:
message LogicalType {
  FieldType representation = 1;
  string logical_urn = 2;
  bytes logical_payload = 3;
}

If we keep just logical_urn and logical_payload, the logical_payload could
itself be a protobuf with attributes of 1) a serialized class and 2/3)
to/from functions. Or, alternatively, we could have a generalization of the
SchemaRegistry for logical types. Implementations for standard types and
user-defined types would be registered by URN, and the SDK could look them
up given just a URN. I put a brief section about this alternative in the
doc last week [1]. What I suggested there included removing the
logical_payload field, which is probably overkill. The critical piece is
just relying on a registry in the SDK to look up types and to/from
functions rather than storing them in the portable schema itself.

I kind of like keeping the LogicalType message generic for now, since it
gives us a way to try out these various approaches, but maybe that's just a
cop out.

[1]
https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy

On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com> wrote:

>
>
> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <bhule...@google.com>
> wrote:
>
>>
>>
>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <bhule...@google.com>
>>> wrote:
>>>
>>>> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
>>>> and we should think about that.
>>>>
>>>> I'm a little concerned that the current proposals for a portable
>>>> representation don't actually fully represent Schemas. It seems to me that
>>>> the current java-only Schemas are made up three concepts that are
>>>> intertwined:
>>>> (a) The Java SDK specific code for schema inference, type coercion, and
>>>> "schema-aware" transforms.
>>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular
>>>> Schema[3].
>>>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and
>>>> functions for converting Rows with that schema to/from a Java type T. Those
>>>> functions and the RowCoder are then composed to provider a Coder for the
>>>> type T.
>>>>
>>>
>>> RowCoder is currently just an internal implementation detail, it can be
>>> eliminated. SchemaCoder is the only thing that determines a schema today.
>>>
>> Why not keep it around? I think it would make sense to have a RowCoder
>> implementation in every SDK, as well as something like SchemaCoder that
>> defines a conversion from that SDK's "Row" to the language type.
>>
>
> The point is that from a programmer's perspective, there is nothing much
> special about Row. Any type can have a schema, and the only special thing
> about Row is that it's always guaranteed to exist. From that standpoint,
> Row is nearly an implementation detail. Today RowCoder is never set on
> _any_ PCollection, it's literally just used as a helper library, so there's
> no real need for it to exist as a "Coder."
>
>
>>
>>>
>>>>
>>>> We're not concerned with (a) at this time since that's specific to the
>>>> SDK, not the interface between them. My understanding is we just want to
>>>> define a portable representation for (b) and/or (c).
>>>>
>>>> What has been discussed so far is really just a portable representation
>>>> for (b), the RowCoder, since the discussion is only around how to represent
>>>> the schema itself and not the to/from functions.
>>>>
>>>
>>> Correct. The to/from functions are actually related to a). One of the
>>> big goals of schemas was that users should not be forced to operate on rows
>>> to get schemas. A user can create PCollection<MyRandomType> and as long as
>>> the SDK can infer a schema from MyRandomType, the user never needs to even
>>> see a Row object. The to/fromRow functions are what make this work today.
>>>
>>>
>>
>> One of the points I'd like to make is that this type coercion is a useful
>> concept on it's own, separate from schemas. It's especially useful for a
>> type that has a schema and is encoded by RowCoder since that can represent
>> many more types, but the type coercion doesn't have to be tied to just
>> schemas and RowCoder. We could also do type coercion for types that are
>> effectively wrappers around an integer or a string. It could just be a
>> general way to map language types to base types (i.e. types that we have a
>> coder for). Then it just becomes a general framework for extending coders
>> to represent more language types.
>>
>
> Let's not tie those conversations. Maybe a similar concept will hold true
> for general coders (or we might decide to get rid of coders in favor of
> schemas, in which case that becomes moot), but I don't think we should
> prematurely generalize.
>
>
>>
>>
>>
>>> One of the outstanding questions for that schema representation is how
>>>> to represent logical types, which may or may not have some language type in
>>>> each SDK (the canonical example being a timsetamp type with seconds and
>>>> nanos and java.time.Instant). I think this question is critically
>>>> important, because (c), the SchemaCoder, is actually *defining a logical
>>>> type* with a language type T in the Java SDK. This becomes clear when you
>>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both
>>>> essentially have three attributes: a base type, and two functions for
>>>> converting to/from that base type. The only difference is for SchemaCoder
>>>> that base type must be a Row so it can be represented by a Schema alone,
>>>> while LogicalType can have any base type that can be represented by
>>>> FieldType, including a Row.
>>>>
>>>
>>> This is not true actually. SchemaCoder can have any base type, that's
>>> why (in Java) it's SchemaCoder<T>. This is why PCollection<T> can have a
>>> schema, even if T is not Row.
>>>
>>>
>> I'm not sure I effectively communicated what I meant - When I said
>> SchemaCoder's "base type" I wasn't referring to T, I was referring to the
>> base FieldType, whose coder we use for this type. I meant "base type" to be
>> analogous to LogicalType's `getBaseType`, or what Kenn is suggesting we
>> call "representation" in the portable beam schemas doc. To define some
>> terms from my original message:
>> base type = an instance of FieldType, crucially this is something that we
>> have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...)
>> language type (or "T", "type T", "logical type") = Some Java class (or
>> something analogous in the other SDKs) that we may or may not have a coder
>> for. It's possible to define functions for converting instances of the
>> language type to/from the base type.
>>
>> I was just trying to make the case that SchemaCoder is really a special
>> case of LogicalType, where `getBaseType` always returns a Row with the
>> stored Schema.
>>
>
> Yeah, I think  I got that point.
>
> Can you propose what the protos would look like in this case? Right now
> LogicalType does not contain the to/from conversion functions in the proto.
> Do you think we'll need to add these in?
>
>
>> To make the point with code: SchemaCoder<T> can be made to implement
>> Schema.LogicalType<T,Row> with trivial implementations of getBaseType,
>> toBaseType, and toInputType (I'm not trying to say we should or shouldn't
>> do this, just using it illustrate my point):
>>
>> class SchemaCoder extends CustomCoder<T> implements Schema.LogicalType<T,
>> Row> {
>>   ...
>>
>>   @Override
>>   FieldType getBaseType() {
>>     return FieldType.row(getSchema());
>>   }
>>
>>   @Override
>>   public Row toBaseType() {
>>     return this.toRowFunction.apply(input);
>>   }
>>
>>   @Override
>>   public T toInputType(Row base) {
>>     return this.fromRowFunction.apply(base);
>>   }
>>   ...
>> }
>>
>>
>>>> I think it may make sense to fully embrace this duality, by letting
>>>> SchemaCoder have a baseType other than just Row and renaming it to
>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware
>>>> transforms (a) would operate only on LogicalTypeCoders with a Row base
>>>> type. Perhaps some of the current schema logic could  alsobe applied more
>>>> generally to any logical type  - for example, to provide type coercion for
>>>> logical types with a base type other than Row, like int64 and a timestamp
>>>> class backed by millis, or fixed size bytes and a UUID class. And having a
>>>> portable representation that represents those (non Row backed) logical
>>>> types with some URN would also allow us to pass them to other languages
>>>> without unnecessarily wrapping them in a Row in order to use SchemaCoder.
>>>>
>>>
>>> I think the actual overlap here is between the to/from functions in
>>> SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the
>>> equivalent functionality in LogicalType. However making all of schemas
>>> simply just a logical type feels a bit awkward and circular to me. Maybe we
>>> should refactor that part out into a LogicalTypeConversion proto, and
>>> reference that from both LogicalType and from SchemaCoder?
>>>
>>
>> LogicalType is already potentially circular though. A schema can have a
>> field with a logical type, and that logical type can have a base type of
>> Row with a field with a logical type (and on and on...). To me it seems
>> elegant, not awkward, to recognize that SchemaCoder is just a special case
>> of this concept.
>>
>> Something like the LogicalTypeConversion proto would definitely be an
>> improvement, but I would still prefer just using a top-level logical type :)
>>
>>>
>>>
>>> I've added a section to the doc [6] to propose this alternative in the
>>>> context of the portable representation but I wanted to bring it up here as
>>>> well to solicit feedback.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59
>>>> [3]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48
>>>> [4]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33
>>>> [5]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489
>>>> [6]
>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin
>>>>
>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <bhule...@google.com>
>>>> wrote:
>>>>
>>>>> Ah thanks! I added some language there.
>>>>>
>>>>> *From: *Kenneth Knowles <k...@apache.org>
>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM
>>>>> *To: *dev
>>>>>
>>>>>
>>>>>> *From: *Brian Hulette <bhule...@google.com>
>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM
>>>>>> *To: * <dev@beam.apache.org>
>>>>>>
>>>>>> We briefly discussed using arrow schemas in place of beam schemas
>>>>>>> entirely in an arrow thread [1]. The biggest reason not to this was 
>>>>>>> that we
>>>>>>> wanted to have a type for large iterables in beam schemas. But given 
>>>>>>> that
>>>>>>> large iterables aren't currently implemented, beam schemas look very
>>>>>>> similar to arrow schemas.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> I think it makes sense to take inspiration from arrow schemas where
>>>>>>> possible, and maybe even copy them outright. Arrow already has a 
>>>>>>> portable
>>>>>>> (flatbuffers) schema representation [2], and implementations for it in 
>>>>>>> many
>>>>>>> languages that we may be able to re-use as we bring schemas to more SDKs
>>>>>>> (the project has Python and Go implementations). There are a couple of
>>>>>>> concepts in Arrow schemas that are specific for the format and wouldn't
>>>>>>> make sense for us, (fields can indicate whether or not they are 
>>>>>>> dictionary
>>>>>>> encoded, and the schema has an endianness field), but if you drop those
>>>>>>> concepts the arrow spec looks pretty similar to the beam proto spec.
>>>>>>>
>>>>>>
>>>>>> FWIW I left a blank section in the doc for filling out what the
>>>>>> differences are and why, and conversely what the interop opportunities 
>>>>>> may
>>>>>> be. Such sections are some of my favorite sections of design docs.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>> Brian
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E
>>>>>>> [2]
>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>>>>>>>
>>>>>>> *From: *Robert Bradshaw <rober...@google.com>
>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM
>>>>>>> *To: *dev
>>>>>>>
>>>>>>> From: Reuven Lax <re...@google.com>
>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM
>>>>>>>> To: dev
>>>>>>>>
>>>>>>>> > Also in the future we might be able to do optimizations at the
>>>>>>>> runner level if at the portability layer we understood schemes instead 
>>>>>>>> of
>>>>>>>> just raw coders. This could be things like only parsing a subset of a 
>>>>>>>> row
>>>>>>>> (if we know only a few fields are accessed) or using a columnar data
>>>>>>>> structure like Arrow to encode batches of rows across portability. This
>>>>>>>> doesn't affect data semantics of course, but having a richer,
>>>>>>>> more-expressive type system opens up other opportunities.
>>>>>>>>
>>>>>>>> But we could do all of that with a RowCoder we understood to
>>>>>>>> designate
>>>>>>>> the type(s), right?
>>>>>>>>
>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>> >>
>>>>>>>> >> On the flip side, Schemas are equivalent to the space of Coders
>>>>>>>> with
>>>>>>>> >> the addition of a RowCoder and the ability to materialize to
>>>>>>>> something
>>>>>>>> >> other than bytes, right? (Perhaps I'm missing something big
>>>>>>>> here...)
>>>>>>>> >> This may make a backwards-compatible transition easier.
>>>>>>>> (SDK-side, the
>>>>>>>> >> ability to reason about and operate on such types is of course
>>>>>>>> much
>>>>>>>> >> richer than anything Coders offer right now.)
>>>>>>>> >>
>>>>>>>> >> From: Reuven Lax <re...@google.com>
>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM
>>>>>>>> >> To: dev
>>>>>>>> >>
>>>>>>>> >> > FYI I can imagine a world in which we have no coders. We could
>>>>>>>> define the entire model on top of schemas. Today's "Coder" is 
>>>>>>>> completely
>>>>>>>> equivalent to a single-field schema with a logical-type field 
>>>>>>>> (actually the
>>>>>>>> latter is slightly more expressive as you aren't forced to serialize 
>>>>>>>> into
>>>>>>>> bytes).
>>>>>>>> >> >
>>>>>>>> >> > Due to compatibility constraints and the effort that would be
>>>>>>>> involved in such a change, I think the practical decision should be for
>>>>>>>> schemas and coders to coexist for the time being. However when we start
>>>>>>>> planning Beam 3.0, deprecating coders is something I would like to 
>>>>>>>> suggest.
>>>>>>>> >> >
>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org>
>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM
>>>>>>>> >> >> To: dev
>>>>>>>> >> >>
>>>>>>>> >> >> > This is a huge development. Top posting because I can be
>>>>>>>> more compact.
>>>>>>>> >> >> >
>>>>>>>> >> >> > I really think after the initial idea converges this needs
>>>>>>>> a design doc with goals and alternatives. It is an extraordinarily
>>>>>>>> consequential model change. So in the spirit of doing the work / bias
>>>>>>>> towards action, I created a quick draft at
>>>>>>>> https://s.apache.org/beam-schemas and added everyone on this
>>>>>>>> thread as editors. I am still in the process of writing this to match 
>>>>>>>> the
>>>>>>>> thread.
>>>>>>>> >> >>
>>>>>>>> >> >> Thanks! Added some comments there.
>>>>>>>> >> >>
>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types
>>>>>>>> to represent nanos the same way Java and proto do.
>>>>>>>> >> >>
>>>>>>>> >> >> As per the other discussion, I'm unsure the value in
>>>>>>>> supporting
>>>>>>>> >> >> multiple timestamp resolutions is high enough to outweigh the
>>>>>>>> cost.
>>>>>>>> >> >>
>>>>>>>> >> >> > *Why multiple int types?* The domain of values for these
>>>>>>>> types are different. For a language with one "int" or "number" type, 
>>>>>>>> that's
>>>>>>>> another domain of values.
>>>>>>>> >> >>
>>>>>>>> >> >> What is the value in having different domains? If your data
>>>>>>>> has a
>>>>>>>> >> >> natural domain, chances are it doesn't line up exactly with
>>>>>>>> one of
>>>>>>>> >> >> these. I guess it's for languages whose types have specific
>>>>>>>> domains?
>>>>>>>> >> >> (There's also compactness in representation, encoded and
>>>>>>>> in-memory,
>>>>>>>> >> >> though I'm not sure that's high.)
>>>>>>>> >> >>
>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take
>>>>>>>> this path is Paramount. So tying it directly to a row-oriented coder 
>>>>>>>> seems
>>>>>>>> counterproductive.
>>>>>>>> >> >>
>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. They are,
>>>>>>>> however,
>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be
>>>>>>>> a lot of
>>>>>>>> >> >> overlap between what Coders express in terms of element typing
>>>>>>>> >> >> information and what Schemas express, and I'd rather have one
>>>>>>>> concept
>>>>>>>> >> >> if possible. Or have a clear division of responsibilities.
>>>>>>>> >> >>
>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued map or
>>>>>>>> large-iterable-valued map? (honest question, not rhetorical)
>>>>>>>> >> >>
>>>>>>>> >> >> Multimap has a different notion of what it means to contain a
>>>>>>>> value,
>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc.
>>>>>>>> Maybe this
>>>>>>>> >> >> isn't worth a new primitive type.
>>>>>>>> >> >>
>>>>>>>> >> >> > *URN/enum for type names*: I see the case for both. The
>>>>>>>> core types are fundamental enough they should never really change - 
>>>>>>>> after
>>>>>>>> all, proto, thrift, avro, arrow, have addressed this (not to mention 
>>>>>>>> most
>>>>>>>> programming languages). Maybe additions once every few years. I prefer 
>>>>>>>> the
>>>>>>>> smallest intersection of these schema languages. A oneof is more clear,
>>>>>>>> while URN emphasizes the similarity of built-in and logical types.
>>>>>>>> >> >>
>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level
>>>>>>>> primitive/logical
>>>>>>>> >> >> type in any of these other systems? I have a bias towards all
>>>>>>>> types
>>>>>>>> >> >> being on the same footing unless there is compelling reason
>>>>>>>> to divide
>>>>>>>> >> >> things into primitive/use-defined ones.
>>>>>>>> >> >>
>>>>>>>> >> >> Here it seems like the most essential value of the primitive
>>>>>>>> type set
>>>>>>>> >> >> is to describe the underlying representation, for encoding
>>>>>>>> elements in
>>>>>>>> >> >> a variety of ways (notably columnar, but also interfacing
>>>>>>>> with other
>>>>>>>> >> >> external systems like IOs). Perhaps, rather than the previous
>>>>>>>> >> >> suggestion of making everything a logical of bytes, this
>>>>>>>> could be made
>>>>>>>> >> >> clear by still making everything a logical type, but renaming
>>>>>>>> >> >> "TypeName" to Representation. There would be URNs (typically
>>>>>>>> with
>>>>>>>> >> >> empty payloads) for the various primitive types (whose
>>>>>>>> mapping to
>>>>>>>> >> >> their representations would be the identity).
>>>>>>>> >> >>
>>>>>>>> >> >> - Robert
>>>>>>>>
>>>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to