Ah I see, I didn't realize that. Then I suppose we'll need to/from functions somewhere in the logical type conversion to preserve the current behavior.
I'm still a little hesitant to make these functions an explicit part of LogicalTypeConversion for another reason. Down the road, schemas could give us an avenue to use a batched columnar format (presumably arrow, but of course others are possible). By making to/from an explicit part of logical types we add some element-wise logic to a schema representation that's otherwise ambivalent to element-wise vs. batched encodings. I suppose you could make an argument that to/from are only for custom types. There will also be some set of well-known types identified only by URN and some parameters, which could easily be translated to a columnar format. We could just not support custom types fully if we add a columnar encoding, or maybe add optional toBatch/fromBatch functions when/if we get there. What about something like this that makes the two different types of logical types explicit? // Describes a logical type and how to convert between it and its representation (e.g. Row). message LogicalTypeConversion { oneof conversion { message Standard standard = 1; message Custom custom = 2; } message Standard { String urn = 1; repeated string args = 2; // could also be a map } message Custom { FunctionSpec(?) toRepresentation = 1; FunctionSpec(?) fromRepresentation = 2; bytes type = 3; // e.g. serialized class for Java } } And LogicalType and Schema become: message LogicalType { FieldType representation = 1; LogicalTypeConversion conversion = 2; } message Schema { ... repeated Field fields = 1; LogicalTypeConversion conversion = 2; // implied that representation is Row } Brian On Sat, Jun 1, 2019 at 10:44 AM Reuven Lax <re...@google.com> wrote: > Keep in mind that right now the SchemaRegistry is only assumed to exist at > graph-construction time, not at execution time; all information in the > schema registry is embedded in the SchemaCoder, which is the only thing we > keep around when the pipeline is actually running. We could look into > changing this, but it would potentially be a very big change, and I do > think we should start getting users actively using schemas soon. > > On Fri, May 31, 2019 at 3:40 PM Brian Hulette <bhule...@google.com> wrote: > >> > Can you propose what the protos would look like in this case? Right now >> LogicalType does not contain the to/from conversion functions in the proto. >> Do you think we'll need to add these in? >> >> Maybe. Right now the proposed LogicalType message is pretty >> simple/generic: >> message LogicalType { >> FieldType representation = 1; >> string logical_urn = 2; >> bytes logical_payload = 3; >> } >> >> If we keep just logical_urn and logical_payload, the logical_payload >> could itself be a protobuf with attributes of 1) a serialized class and >> 2/3) to/from functions. Or, alternatively, we could have a generalization >> of the SchemaRegistry for logical types. Implementations for standard types >> and user-defined types would be registered by URN, and the SDK could look >> them up given just a URN. I put a brief section about this alternative in >> the doc last week [1]. What I suggested there included removing the >> logical_payload field, which is probably overkill. The critical piece is >> just relying on a registry in the SDK to look up types and to/from >> functions rather than storing them in the portable schema itself. >> >> I kind of like keeping the LogicalType message generic for now, since it >> gives us a way to try out these various approaches, but maybe that's just a >> cop out. >> >> [1] >> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy >> >> On Fri, May 31, 2019 at 12:36 PM Reuven Lax <re...@google.com> wrote: >> >>> >>> >>> On Tue, May 28, 2019 at 10:11 AM Brian Hulette <bhule...@google.com> >>> wrote: >>> >>>> >>>> >>>> On Sun, May 26, 2019 at 1:25 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> >>>>> >>>>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette <bhule...@google.com> >>>>> wrote: >>>>> >>>>>> *tl;dr:* SchemaCoder represents a logical type with a base type of >>>>>> Row and we should think about that. >>>>>> >>>>>> I'm a little concerned that the current proposals for a portable >>>>>> representation don't actually fully represent Schemas. It seems to me >>>>>> that >>>>>> the current java-only Schemas are made up three concepts that are >>>>>> intertwined: >>>>>> (a) The Java SDK specific code for schema inference, type coercion, >>>>>> and "schema-aware" transforms. >>>>>> (b) A RowCoder[1] that encodes Rows[2] which have a particular >>>>>> Schema[3]. >>>>>> (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and >>>>>> functions for converting Rows with that schema to/from a Java type T. >>>>>> Those >>>>>> functions and the RowCoder are then composed to provider a Coder for the >>>>>> type T. >>>>>> >>>>> >>>>> RowCoder is currently just an internal implementation detail, it can >>>>> be eliminated. SchemaCoder is the only thing that determines a schema >>>>> today. >>>>> >>>> Why not keep it around? I think it would make sense to have a RowCoder >>>> implementation in every SDK, as well as something like SchemaCoder that >>>> defines a conversion from that SDK's "Row" to the language type. >>>> >>> >>> The point is that from a programmer's perspective, there is nothing much >>> special about Row. Any type can have a schema, and the only special thing >>> about Row is that it's always guaranteed to exist. From that standpoint, >>> Row is nearly an implementation detail. Today RowCoder is never set on >>> _any_ PCollection, it's literally just used as a helper library, so there's >>> no real need for it to exist as a "Coder." >>> >>> >>>> >>>>> >>>>>> >>>>>> We're not concerned with (a) at this time since that's specific to >>>>>> the SDK, not the interface between them. My understanding is we just want >>>>>> to define a portable representation for (b) and/or (c). >>>>>> >>>>>> What has been discussed so far is really just a portable >>>>>> representation for (b), the RowCoder, since the discussion is only around >>>>>> how to represent the schema itself and not the to/from functions. >>>>>> >>>>> >>>>> Correct. The to/from functions are actually related to a). One of the >>>>> big goals of schemas was that users should not be forced to operate on >>>>> rows >>>>> to get schemas. A user can create PCollection<MyRandomType> and as long as >>>>> the SDK can infer a schema from MyRandomType, the user never needs to even >>>>> see a Row object. The to/fromRow functions are what make this work today. >>>>> >>>>> >>>> >>>> One of the points I'd like to make is that this type coercion is a >>>> useful concept on it's own, separate from schemas. It's especially useful >>>> for a type that has a schema and is encoded by RowCoder since that can >>>> represent many more types, but the type coercion doesn't have to be tied to >>>> just schemas and RowCoder. We could also do type coercion for types that >>>> are effectively wrappers around an integer or a string. It could just be a >>>> general way to map language types to base types (i.e. types that we have a >>>> coder for). Then it just becomes a general framework for extending coders >>>> to represent more language types. >>>> >>> >>> Let's not tie those conversations. Maybe a similar concept will hold >>> true for general coders (or we might decide to get rid of coders in favor >>> of schemas, in which case that becomes moot), but I don't think we should >>> prematurely generalize. >>> >>> >>>> >>>> >>>> >>>>> One of the outstanding questions for that schema representation is how >>>>>> to represent logical types, which may or may not have some language type >>>>>> in >>>>>> each SDK (the canonical example being a timsetamp type with seconds and >>>>>> nanos and java.time.Instant). I think this question is critically >>>>>> important, because (c), the SchemaCoder, is actually *defining a logical >>>>>> type* with a language type T in the Java SDK. This becomes clear when you >>>>>> compare SchemaCoder[4] to the Schema.LogicalType interface[5] - both >>>>>> essentially have three attributes: a base type, and two functions for >>>>>> converting to/from that base type. The only difference is for SchemaCoder >>>>>> that base type must be a Row so it can be represented by a Schema alone, >>>>>> while LogicalType can have any base type that can be represented by >>>>>> FieldType, including a Row. >>>>>> >>>>> >>>>> This is not true actually. SchemaCoder can have any base type, that's >>>>> why (in Java) it's SchemaCoder<T>. This is why PCollection<T> can have a >>>>> schema, even if T is not Row. >>>>> >>>>> >>>> I'm not sure I effectively communicated what I meant - When I said >>>> SchemaCoder's "base type" I wasn't referring to T, I was referring to the >>>> base FieldType, whose coder we use for this type. I meant "base type" to be >>>> analogous to LogicalType's `getBaseType`, or what Kenn is suggesting we >>>> call "representation" in the portable beam schemas doc. To define some >>>> terms from my original message: >>>> base type = an instance of FieldType, crucially this is something that >>>> we have a coder for (be it VarIntCoder, Utf8Coder, RowCoder, ...) >>>> language type (or "T", "type T", "logical type") = Some Java class (or >>>> something analogous in the other SDKs) that we may or may not have a coder >>>> for. It's possible to define functions for converting instances of the >>>> language type to/from the base type. >>>> >>>> I was just trying to make the case that SchemaCoder is really a special >>>> case of LogicalType, where `getBaseType` always returns a Row with the >>>> stored Schema. >>>> >>> >>> Yeah, I think I got that point. >>> >>> Can you propose what the protos would look like in this case? Right now >>> LogicalType does not contain the to/from conversion functions in the proto. >>> Do you think we'll need to add these in? >>> >>> >>>> To make the point with code: SchemaCoder<T> can be made to implement >>>> Schema.LogicalType<T,Row> with trivial implementations of getBaseType, >>>> toBaseType, and toInputType (I'm not trying to say we should or shouldn't >>>> do this, just using it illustrate my point): >>>> >>>> class SchemaCoder extends CustomCoder<T> implements >>>> Schema.LogicalType<T, Row> { >>>> ... >>>> >>>> @Override >>>> FieldType getBaseType() { >>>> return FieldType.row(getSchema()); >>>> } >>>> >>>> @Override >>>> public Row toBaseType() { >>>> return this.toRowFunction.apply(input); >>>> } >>>> >>>> @Override >>>> public T toInputType(Row base) { >>>> return this.fromRowFunction.apply(base); >>>> } >>>> ... >>>> } >>>> >>>> >>>>>> I think it may make sense to fully embrace this duality, by letting >>>>>> SchemaCoder have a baseType other than just Row and renaming it to >>>>>> LogicalTypeCoder/LanguageTypeCoder. The current Java SDK schema-aware >>>>>> transforms (a) would operate only on LogicalTypeCoders with a Row base >>>>>> type. Perhaps some of the current schema logic could alsobe applied more >>>>>> generally to any logical type - for example, to provide type coercion >>>>>> for >>>>>> logical types with a base type other than Row, like int64 and a timestamp >>>>>> class backed by millis, or fixed size bytes and a UUID class. And having >>>>>> a >>>>>> portable representation that represents those (non Row backed) logical >>>>>> types with some URN would also allow us to pass them to other languages >>>>>> without unnecessarily wrapping them in a Row in order to use SchemaCoder. >>>>>> >>>>> >>>>> I think the actual overlap here is between the to/from functions in >>>>> SchemaCoder (which is what allows SchemaCoder<T> where T != Row) and the >>>>> equivalent functionality in LogicalType. However making all of schemas >>>>> simply just a logical type feels a bit awkward and circular to me. Maybe >>>>> we >>>>> should refactor that part out into a LogicalTypeConversion proto, and >>>>> reference that from both LogicalType and from SchemaCoder? >>>>> >>>> >>>> LogicalType is already potentially circular though. A schema can have a >>>> field with a logical type, and that logical type can have a base type of >>>> Row with a field with a logical type (and on and on...). To me it seems >>>> elegant, not awkward, to recognize that SchemaCoder is just a special case >>>> of this concept. >>>> >>>> Something like the LogicalTypeConversion proto would definitely be an >>>> improvement, but I would still prefer just using a top-level logical type >>>> :) >>>> >>>>> >>>>> >>>>> I've added a section to the doc [6] to propose this alternative in the >>>>>> context of the portable representation but I wanted to bring it up here >>>>>> as >>>>>> well to solicit feedback. >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoder.java#L41 >>>>>> [2] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L59 >>>>>> [3] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L48 >>>>>> [4] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L33 >>>>>> [5] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L489 >>>>>> [6] >>>>>> https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.7570feur1qin >>>>>> >>>>>> On Fri, May 10, 2019 at 9:16 AM Brian Hulette <bhule...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Ah thanks! I added some language there. >>>>>>> >>>>>>> *From: *Kenneth Knowles <k...@apache.org> >>>>>>> *Date: *Thu, May 9, 2019 at 5:31 PM >>>>>>> *To: *dev >>>>>>> >>>>>>> >>>>>>>> *From: *Brian Hulette <bhule...@google.com> >>>>>>>> *Date: *Thu, May 9, 2019 at 2:02 PM >>>>>>>> *To: * <dev@beam.apache.org> >>>>>>>> >>>>>>>> We briefly discussed using arrow schemas in place of beam schemas >>>>>>>>> entirely in an arrow thread [1]. The biggest reason not to this was >>>>>>>>> that we >>>>>>>>> wanted to have a type for large iterables in beam schemas. But given >>>>>>>>> that >>>>>>>>> large iterables aren't currently implemented, beam schemas look very >>>>>>>>> similar to arrow schemas. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I think it makes sense to take inspiration from arrow schemas >>>>>>>>> where possible, and maybe even copy them outright. Arrow already has a >>>>>>>>> portable (flatbuffers) schema representation [2], and implementations >>>>>>>>> for >>>>>>>>> it in many languages that we may be able to re-use as we bring >>>>>>>>> schemas to >>>>>>>>> more SDKs (the project has Python and Go implementations). There are a >>>>>>>>> couple of concepts in Arrow schemas that are specific for the format >>>>>>>>> and >>>>>>>>> wouldn't make sense for us, (fields can indicate whether or not they >>>>>>>>> are >>>>>>>>> dictionary encoded, and the schema has an endianness field), but if >>>>>>>>> you >>>>>>>>> drop those concepts the arrow spec looks pretty similar to the beam >>>>>>>>> proto >>>>>>>>> spec. >>>>>>>>> >>>>>>>> >>>>>>>> FWIW I left a blank section in the doc for filling out what the >>>>>>>> differences are and why, and conversely what the interop opportunities >>>>>>>> may >>>>>>>> be. Such sections are some of my favorite sections of design docs. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> >>>>>>>> Brian >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://lists.apache.org/thread.html/6be7715e13b71c2d161e4378c5ca1c76ac40cfc5988a03ba87f1c434@%3Cdev.beam.apache.org%3E >>>>>>>>> [2] >>>>>>>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 >>>>>>>>> >>>>>>>>> *From: *Robert Bradshaw <rober...@google.com> >>>>>>>>> *Date: *Thu, May 9, 2019 at 1:38 PM >>>>>>>>> *To: *dev >>>>>>>>> >>>>>>>>> From: Reuven Lax <re...@google.com> >>>>>>>>>> Date: Thu, May 9, 2019 at 7:29 PM >>>>>>>>>> To: dev >>>>>>>>>> >>>>>>>>>> > Also in the future we might be able to do optimizations at the >>>>>>>>>> runner level if at the portability layer we understood schemes >>>>>>>>>> instead of >>>>>>>>>> just raw coders. This could be things like only parsing a subset of >>>>>>>>>> a row >>>>>>>>>> (if we know only a few fields are accessed) or using a columnar data >>>>>>>>>> structure like Arrow to encode batches of rows across portability. >>>>>>>>>> This >>>>>>>>>> doesn't affect data semantics of course, but having a richer, >>>>>>>>>> more-expressive type system opens up other opportunities. >>>>>>>>>> >>>>>>>>>> But we could do all of that with a RowCoder we understood to >>>>>>>>>> designate >>>>>>>>>> the type(s), right? >>>>>>>>>> >>>>>>>>>> > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw < >>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>> >> >>>>>>>>>> >> On the flip side, Schemas are equivalent to the space of >>>>>>>>>> Coders with >>>>>>>>>> >> the addition of a RowCoder and the ability to materialize to >>>>>>>>>> something >>>>>>>>>> >> other than bytes, right? (Perhaps I'm missing something big >>>>>>>>>> here...) >>>>>>>>>> >> This may make a backwards-compatible transition easier. >>>>>>>>>> (SDK-side, the >>>>>>>>>> >> ability to reason about and operate on such types is of course >>>>>>>>>> much >>>>>>>>>> >> richer than anything Coders offer right now.) >>>>>>>>>> >> >>>>>>>>>> >> From: Reuven Lax <re...@google.com> >>>>>>>>>> >> Date: Thu, May 9, 2019 at 4:52 PM >>>>>>>>>> >> To: dev >>>>>>>>>> >> >>>>>>>>>> >> > FYI I can imagine a world in which we have no coders. We >>>>>>>>>> could define the entire model on top of schemas. Today's "Coder" is >>>>>>>>>> completely equivalent to a single-field schema with a logical-type >>>>>>>>>> field >>>>>>>>>> (actually the latter is slightly more expressive as you aren't >>>>>>>>>> forced to >>>>>>>>>> serialize into bytes). >>>>>>>>>> >> > >>>>>>>>>> >> > Due to compatibility constraints and the effort that would >>>>>>>>>> be involved in such a change, I think the practical decision should >>>>>>>>>> be for >>>>>>>>>> schemas and coders to coexist for the time being. However when we >>>>>>>>>> start >>>>>>>>>> planning Beam 3.0, deprecating coders is something I would like to >>>>>>>>>> suggest. >>>>>>>>>> >> > >>>>>>>>>> >> > On Thu, May 9, 2019 at 7:48 AM Robert Bradshaw < >>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>> >> >> >>>>>>>>>> >> >> From: Kenneth Knowles <k...@apache.org> >>>>>>>>>> >> >> Date: Thu, May 9, 2019 at 10:05 AM >>>>>>>>>> >> >> To: dev >>>>>>>>>> >> >> >>>>>>>>>> >> >> > This is a huge development. Top posting because I can be >>>>>>>>>> more compact. >>>>>>>>>> >> >> > >>>>>>>>>> >> >> > I really think after the initial idea converges this >>>>>>>>>> needs a design doc with goals and alternatives. It is an >>>>>>>>>> extraordinarily >>>>>>>>>> consequential model change. So in the spirit of doing the work / bias >>>>>>>>>> towards action, I created a quick draft at >>>>>>>>>> https://s.apache.org/beam-schemas and added everyone on this >>>>>>>>>> thread as editors. I am still in the process of writing this to >>>>>>>>>> match the >>>>>>>>>> thread. >>>>>>>>>> >> >> >>>>>>>>>> >> >> Thanks! Added some comments there. >>>>>>>>>> >> >> >>>>>>>>>> >> >> > *Multiple timestamp resolutions*: you can use logcial >>>>>>>>>> types to represent nanos the same way Java and proto do. >>>>>>>>>> >> >> >>>>>>>>>> >> >> As per the other discussion, I'm unsure the value in >>>>>>>>>> supporting >>>>>>>>>> >> >> multiple timestamp resolutions is high enough to outweigh >>>>>>>>>> the cost. >>>>>>>>>> >> >> >>>>>>>>>> >> >> > *Why multiple int types?* The domain of values for these >>>>>>>>>> types are different. For a language with one "int" or "number" type, >>>>>>>>>> that's >>>>>>>>>> another domain of values. >>>>>>>>>> >> >> >>>>>>>>>> >> >> What is the value in having different domains? If your data >>>>>>>>>> has a >>>>>>>>>> >> >> natural domain, chances are it doesn't line up exactly with >>>>>>>>>> one of >>>>>>>>>> >> >> these. I guess it's for languages whose types have specific >>>>>>>>>> domains? >>>>>>>>>> >> >> (There's also compactness in representation, encoded and >>>>>>>>>> in-memory, >>>>>>>>>> >> >> though I'm not sure that's high.) >>>>>>>>>> >> >> >>>>>>>>>> >> >> > *Columnar/Arrow*: making sure we unlock the ability to >>>>>>>>>> take this path is Paramount. So tying it directly to a row-oriented >>>>>>>>>> coder >>>>>>>>>> seems counterproductive. >>>>>>>>>> >> >> >>>>>>>>>> >> >> I don't think Coders are necessarily row-oriented. They >>>>>>>>>> are, however, >>>>>>>>>> >> >> bytes-oriented. (Perhaps they need not be.) There seems to >>>>>>>>>> be a lot of >>>>>>>>>> >> >> overlap between what Coders express in terms of element >>>>>>>>>> typing >>>>>>>>>> >> >> information and what Schemas express, and I'd rather have >>>>>>>>>> one concept >>>>>>>>>> >> >> if possible. Or have a clear division of responsibilities. >>>>>>>>>> >> >> >>>>>>>>>> >> >> > *Multimap*: what does it add over an array-valued map or >>>>>>>>>> large-iterable-valued map? (honest question, not rhetorical) >>>>>>>>>> >> >> >>>>>>>>>> >> >> Multimap has a different notion of what it means to contain >>>>>>>>>> a value, >>>>>>>>>> >> >> can handle (unordered) unions of non-disjoint keys, etc. >>>>>>>>>> Maybe this >>>>>>>>>> >> >> isn't worth a new primitive type. >>>>>>>>>> >> >> >>>>>>>>>> >> >> > *URN/enum for type names*: I see the case for both. The >>>>>>>>>> core types are fundamental enough they should never really change - >>>>>>>>>> after >>>>>>>>>> all, proto, thrift, avro, arrow, have addressed this (not to mention >>>>>>>>>> most >>>>>>>>>> programming languages). Maybe additions once every few years. I >>>>>>>>>> prefer the >>>>>>>>>> smallest intersection of these schema languages. A oneof is more >>>>>>>>>> clear, >>>>>>>>>> while URN emphasizes the similarity of built-in and logical types. >>>>>>>>>> >> >> >>>>>>>>>> >> >> Hmm... Do we have any examples of the multi-level >>>>>>>>>> primitive/logical >>>>>>>>>> >> >> type in any of these other systems? I have a bias towards >>>>>>>>>> all types >>>>>>>>>> >> >> being on the same footing unless there is compelling reason >>>>>>>>>> to divide >>>>>>>>>> >> >> things into primitive/use-defined ones. >>>>>>>>>> >> >> >>>>>>>>>> >> >> Here it seems like the most essential value of the >>>>>>>>>> primitive type set >>>>>>>>>> >> >> is to describe the underlying representation, for encoding >>>>>>>>>> elements in >>>>>>>>>> >> >> a variety of ways (notably columnar, but also interfacing >>>>>>>>>> with other >>>>>>>>>> >> >> external systems like IOs). Perhaps, rather than the >>>>>>>>>> previous >>>>>>>>>> >> >> suggestion of making everything a logical of bytes, this >>>>>>>>>> could be made >>>>>>>>>> >> >> clear by still making everything a logical type, but >>>>>>>>>> renaming >>>>>>>>>> >> >> "TypeName" to Representation. There would be URNs >>>>>>>>>> (typically with >>>>>>>>>> >> >> empty payloads) for the various primitive types (whose >>>>>>>>>> mapping to >>>>>>>>>> >> >> their representations would be the identity). >>>>>>>>>> >> >> >>>>>>>>>> >> >> - Robert >>>>>>>>>> >>>>>>>>>