> Can you propose what the protos would look like in this case? Right now LogicalType does not contain the to/from conversion functions in the proto. Do you think we'll need to add these in?
Maybe. Right now the proposed LogicalType message is pretty simple/generic: message LogicalType { FieldType representation = 1; string logical_urn = 2; bytes logical_payload = 3; } If we keep just logical_urn and logical_payload, the logical_payload could itself be a protobuf with attributes of 1) a serialized class and 2/3) to/from functions. Or, alternatively, we could have a generalization of the SchemaRegistry for logical types. Implementations for standard types and user-defined types would be registered by URN, and the SDK could look them up given just a URN. I put a brief section about this alternative in the doc last week [1]. What I suggested there included removing the logical_payload field, which is probably overkill. The critical piece is just relying on a registry in the SDK to look up types and to/from functions rather than storing them in the portable schema itself. I kind of like keeping the LogicalType message generic for now, since it gives us a way to try out these various approaches, but maybe that's just a cop out. [1] https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com> wrote: > > > On Tue, May 28, 2019 at 10:11 AM Brian Hulette <bhule...@google.com> > wrote: > >> >> >> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> wrote: >> >>> >>> >>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <bhule...@google.com> >>> wrote: >>> >>>> *tl;dr:* SchemaCoder represents a logical type with a base type of Row >>>> and we should think about that. >>>> >>>> I'm a little concerned that the current proposals for a portable >>>> representation don't actually fully represent Schemas. It seems to me that >>>> the current java-only Schemas are made up three concepts that are >>>> intertwined: >>>> (a) The Java SDK specific code for schema inference, type coercion, and >>>> "schema-aware" transforms. >>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular >>>> Schema[3]. >>>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and >>>> functions for converting Rows with that schema to/from a Java type T. Those >>>> functions and the RowCoder are then composed to provider a Coder for the >>>> type T. >>>> >>> >>> RowCoder is currently just an internal implementation detail, it can be >>> eliminated. SchemaCoder is the only thing that determines a schema today. >>> >> Why not keep it around? I think it would make sense to have a RowCoder >> implementation in every SDK, as well as something like SchemaCoder that >> defines a conversion from that SDK's "Row" to the language type. >> > > The point is that from a programmer's perspective, there is nothing much > special about Row. Any type can have a schema, and the only special thing > about Row is that it's always guaranteed to exist. From that standpoint, > Row is nearly an implementation detail. Today RowCoder is never set on > _any_ PCollection, it's literally just used as a helper library, so there's > no real need for it to exist as a "Coder." > > >> >>> >>>> >>>> We're not concerned with (a) at this time since that's specific to the >>>> SDK, not the interface between them. My understanding is we just want to >>>> define a portable representation for (b) and/or (c). >>>> >>>> What has been discussed so far is really just a portable representation >>>> for (b), the RowCoder, since the discussion is only around how to represent >>>> the schema itself and not the to/from functions. >>>> >>> >>> Correct. The to/from functions are actually related to a). One of the >>> big goals of schemas was that users should not be forced to operate on rows >>> to get schemas. A user can create PCollection<MyRandomType> and as long as >>> the SDK can infer a schema from MyRandomType, the user never needs to even >>> see a Row object. The to/fromRow functions are what make this work today. >>> >>> >> >> One of the points I'd like to make is that this type coercion is a useful >> concept on it's own, separate from schemas. It's especially useful for a >> type that has a schema and is encoded by RowCoder since that can represent >> many more types, but the type coercion doesn't have to be tied to just >> schemas and RowCoder. We could also do type coercion for types that are >> effectively wrappers around an integer or a string. It could just be a >> general way to map language types to base types (i.e. types that we have a >> coder for). Then it just becomes a general framework for extending coders >> to represent more language types. >> > > Let's not tie those conversations. Maybe a similar concept will hold true > for general coders (or we might decide to get rid of coders in favor of > schemas, in which case that becomes moot), but I don't think we should > prematurely generalize. > > >> >> >> >>> One of the outstanding questions for that schema representation is how >>>> to represent logical types, which may or may not have some language type in >>>> each SDK (the canonical example being a timsetamp type with seconds and >>>> nanos and java.time.Instant). I think this question is critically >>>> important, because (c), the SchemaCoder, is actually *defining a logical >>>> type* with a language type T in the Java SDK. This becomes clear when you >>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both >>>> essentially have three attributes: a base type, and two functions for >>>> converting to/from that base type. The only difference is for SchemaCoder >>>> that base type must be a Row so it can be represented by a Schema alone, >>>> while LogicalType can have any base type that can be represented by >>>> FieldType, including a Row. >>>> >>> >>> This is not true actually. SchemaCoder can have any base type, that's >>> why (in Java) it's SchemaCoder<T>. This is why PCollection<T> can have a >>> schema, even if T is not Row. >>> >>> >> I'm not sure I effectively communicated what I meant - When I said >> SchemaCoder's "base type" I wasn't referring to T, I was referring to the >> base FieldType, whose coder we use for this type. I meant "base type" to be >> analogous to LogicalType's `getBaseType`, or what Kenn is suggesting we >> call "representation" in the portable beam schemas doc. To define some >> terms from my original message: >> base type = an instance of FieldType, crucially this is something that we >> have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...) >> language type (or "T", "type T", "logical type") = Some Java class (or >> something analogous in the other SDKs) that we may or may not have a coder >> for. It's possible to define functions for converting instances of the >> language type to/from the base type. >> >> I was just trying to make the case that SchemaCoder is really a special >> case of LogicalType, where `getBaseType` always returns a Row with the >> stored Schema. >> > > Yeah, I think I got that point. > > Can you propose what the protos would look like in this case? Right now > LogicalType does not contain the to/from conversion functions in the proto. > Do you think we'll need to add these in? > > >> To make the point with code: SchemaCoder<T> can be made to implement >> Schema.LogicalType<T,Row> with trivial implementations of getBaseType, >> toBaseType, and toInputType (I'm not trying to say we should or shouldn't >> do this, just using it illustrate my point): >> >> class SchemaCoder extends CustomCoder<T> implements Schema.LogicalType<T, >> Row> { >> ... >> >> @Override >> FieldType getBaseType() { >> return FieldType.row(getSchema()); >> } >> >> @Override >> public Row toBaseType() { >> return this.toRowFunction.apply(input); >> } >> >> @Override >> public T toInputType(Row base) { >> return this.fromRowFunction.apply(base); >> } >> ... >> } >> >> >>>> I think it may make sense to fully embrace this duality, by letting >>>> SchemaCoder have a baseType other than just Row and renaming it to >>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware >>>> transforms (a) would operate only on LogicalTypeCoders with a Row base >>>> type. Perhaps some of the current schema logic could alsobe applied more >>>> generally to any logical type - for example, to provide type coercion for >>>> logical types with a base type other than Row, like int64 and a timestamp >>>> class backed by millis, or fixed size bytes and a UUID class. And having a >>>> portable representation that represents those (non Row backed) logical >>>> types with some URN would also allow us to pass them to other languages >>>> without unnecessarily wrapping them in a Row in order to use SchemaCoder. >>>> >>> >>> I think the actual overlap here is between the to/from functions in >>> SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the >>> equivalent functionality in LogicalType. However making all of schemas >>> simply just a logical type feels a bit awkward and circular to me. Maybe we >>> should refactor that part out into a LogicalTypeConversion proto, and >>> reference that from both LogicalType and from SchemaCoder? >>> >> >> LogicalType is already potentially circular though. A schema can have a >> field with a logical type, and that logical type can have a base type of >> Row with a field with a logical type (and on and on...). To me it seems >> elegant, not awkward, to recognize that SchemaCoder is just a special case >> of this concept. >> >> Something like the LogicalTypeConversion proto would definitely be an >> improvement, but I would still prefer just using a top-level logical type :) >> >>> >>> >>> I've added a section to the doc [6] to propose this alternative in the >>>> context of the portable representation but I wanted to bring it up here as >>>> well to solicit feedback. >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>> [2] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>> [3] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>> [4] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>> [5] >>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>> [6] >>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>> >>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >>>>> Ah thanks! I added some language there. >>>>> >>>>> *From: *Kenneth Knowles <k...@apache.org> >>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>> *To: *dev >>>>> >>>>> >>>>>> *From: *Brian Hulette <bhule...@google.com> >>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>> *To: * <dev@beam.apache.org> >>>>>> >>>>>> We briefly discussed using arrow schemas in place of beam schemas >>>>>>> entirely in an arrow thread [1]. The biggest reason not to this was >>>>>>> that we >>>>>>> wanted to have a type for large iterables in beam schemas. But given >>>>>>> that >>>>>>> large iterables aren't currently implemented, beam schemas look very >>>>>>> similar to arrow schemas. >>>>>>> >>>>>> >>>>>> >>>>>>> I think it makes sense to take inspiration from arrow schemas where >>>>>>> possible, and maybe even copy them outright. Arrow already has a >>>>>>> portable >>>>>>> (flatbuffers) schema representation [2], and implementations for it in >>>>>>> many >>>>>>> languages that we may be able to re-use as we bring schemas to more SDKs >>>>>>> (the project has Python and Go implementations). There are a couple of >>>>>>> concepts in Arrow schemas that are specific for the format and wouldn't >>>>>>> make sense for us, (fields can indicate whether or not they are >>>>>>> dictionary >>>>>>> encoded, and the schema has an endianness field), but if you drop those >>>>>>> concepts the arrow spec looks pretty similar to the beam proto spec. >>>>>>> >>>>>> >>>>>> FWIW I left a blank section in the doc for filling out what the >>>>>> differences are and why, and conversely what the interop opportunities >>>>>> may >>>>>> be. Such sections are some of my favorite sections of design docs. >>>>>> >>>>>> Kenn >>>>>> >>>>>> >>>>>> Brian >>>>>>> >>>>>>> [1] >>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>> [2] >>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>> >>>>>>> *From: *Robert Bradshaw <rober...@google.com> >>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>> *To: *dev >>>>>>> >>>>>>> From: Reuven Lax <re...@google.com> >>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>> To: dev >>>>>>>> >>>>>>>> > Also in the future we might be able to do optimizations at the >>>>>>>> runner level if at the portability layer we understood schemes instead >>>>>>>> of >>>>>>>> just raw coders. This could be things like only parsing a subset of a >>>>>>>> row >>>>>>>> (if we know only a few fields are accessed) or using a columnar data >>>>>>>> structure like Arrow to encode batches of rows across portability. This >>>>>>>> doesn't affect data semantics of course, but having a richer, >>>>>>>> more-expressive type system opens up other opportunities. >>>>>>>> >>>>>>>> But we could do all of that with a RowCoder we understood to >>>>>>>> designate >>>>>>>> the type(s), right? >>>>>>>> >>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>> rober...@google.com> wrote: >>>>>>>> >> >>>>>>>> >> On the flip side, Schemas are equivalent to the space of Coders >>>>>>>> with >>>>>>>> >> the addition of a RowCoder and the ability to materialize to >>>>>>>> something >>>>>>>> >> other than bytes, right? (Perhaps I'm missing something big >>>>>>>> here...) >>>>>>>> >> This may make a backwards-compatible transition easier. >>>>>>>> (SDK-side, the >>>>>>>> >> ability to reason about and operate on such types is of course >>>>>>>> much >>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>> >> >>>>>>>> >> From: Reuven Lax <re...@google.com> >>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>> >> To: dev >>>>>>>> >> >>>>>>>> >> > FYI I can imagine a world in which we have no coders. We could >>>>>>>> define the entire model on top of schemas. Today's "Coder" is >>>>>>>> completely >>>>>>>> equivalent to a single-field schema with a logical-type field >>>>>>>> (actually the >>>>>>>> latter is slightly more expressive as you aren't forced to serialize >>>>>>>> into >>>>>>>> bytes). >>>>>>>> >> > >>>>>>>> >> > Due to compatibility constraints and the effort that would be >>>>>>>> involved in such a change, I think the practical decision should be for >>>>>>>> schemas and coders to coexist for the time being. However when we start >>>>>>>> planning Beam 3.0, deprecating coders is something I would like to >>>>>>>> suggest. >>>>>>>> >> > >>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>> rober...@google.com> wrote: >>>>>>>> >> >> >>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org> >>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>> >> >> To: dev >>>>>>>> >> >> >>>>>>>> >> >> > This is a huge development. Top posting because I can be >>>>>>>> more compact. >>>>>>>> >> >> > >>>>>>>> >> >> > I really think after the initial idea converges this needs >>>>>>>> a design doc with goals and alternatives. It is an extraordinarily >>>>>>>> consequential model change. So in the spirit of doing the work / bias >>>>>>>> towards action, I created a quick draft at >>>>>>>> https://s.apache.org/beam-schemas and added everyone on this >>>>>>>> thread as editors. I am still in the process of writing this to match >>>>>>>> the >>>>>>>> thread. >>>>>>>> >> >> >>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>> >> >> >>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial types >>>>>>>> to represent nanos the same way Java and proto do. >>>>>>>> >> >> >>>>>>>> >> >> As per the other discussion, I'm unsure the value in >>>>>>>> supporting >>>>>>>> >> >> multiple timestamp resolutions is high enough to outweigh the >>>>>>>> cost. >>>>>>>> >> >> >>>>>>>> >> >> > *Why multiple int types?* The domain of values for these >>>>>>>> types are different. For a language with one "int" or "number" type, >>>>>>>> that's >>>>>>>> another domain of values. >>>>>>>> >> >> >>>>>>>> >> >> What is the value in having different domains? If your data >>>>>>>> has a >>>>>>>> >> >> natural domain, chances are it doesn't line up exactly with >>>>>>>> one of >>>>>>>> >> >> these. I guess it's for languages whose types have specific >>>>>>>> domains? >>>>>>>> >> >> (There's also compactness in representation, encoded and >>>>>>>> in-memory, >>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>> >> >> >>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to take >>>>>>>> this path is Paramount. So tying it directly to a row-oriented coder >>>>>>>> seems >>>>>>>> counterproductive. >>>>>>>> >> >> >>>>>>>> >> >> I don't think Coders are necessarily row-oriented. They are, >>>>>>>> however, >>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to be >>>>>>>> a lot of >>>>>>>> >> >> overlap between what Coders express in terms of element typing >>>>>>>> >> >> information and what Schemas express, and I'd rather have one >>>>>>>> concept >>>>>>>> >> >> if possible. Or have a clear division of responsibilities. >>>>>>>> >> >> >>>>>>>> >> >> > *Multimap*: what does it add over an array-valued map or >>>>>>>> large-iterable-valued map? (honest question, not rhetorical) >>>>>>>> >> >> >>>>>>>> >> >> Multimap has a different notion of what it means to contain a >>>>>>>> value, >>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. >>>>>>>> Maybe this >>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>> >> >> >>>>>>>> >> >> > *URN/enum for type names*: I see the case for both. The >>>>>>>> core types are fundamental enough they should never really change - >>>>>>>> after >>>>>>>> all, proto, thrift, avro, arrow, have addressed this (not to mention >>>>>>>> most >>>>>>>> programming languages). Maybe additions once every few years. I prefer >>>>>>>> the >>>>>>>> smallest intersection of these schema languages. A oneof is more clear, >>>>>>>> while URN emphasizes the similarity of built-in and logical types. >>>>>>>> >> >> >>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>> primitive/logical >>>>>>>> >> >> type in any of these other systems? I have a bias towards all >>>>>>>> types >>>>>>>> >> >> being on the same footing unless there is compelling reason >>>>>>>> to divide >>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>> >> >> >>>>>>>> >> >> Here it seems like the most essential value of the primitive >>>>>>>> type set >>>>>>>> >> >> is to describe the underlying representation, for encoding >>>>>>>> elements in >>>>>>>> >> >> a variety of ways (notably columnar, but also interfacing >>>>>>>> with other >>>>>>>> >> >> external systems like IOs). Perhaps, rather than the previous >>>>>>>> >> >> suggestion of making everything a logical of bytes, this >>>>>>>> could be made >>>>>>>> >> >> clear by still making everything a logical type, but renaming >>>>>>>> >> >> "TypeName" to Representation. There would be URNs (typically >>>>>>>> with >>>>>>>> >> >> empty payloads) for the various primitive types (whose >>>>>>>> mapping to >>>>>>>> >> >> their representations would be the identity). >>>>>>>> >> >> >>>>>>>> >> >> - Robert >>>>>>>> >>>>>>>