On Fri, May 15, 2020 at 10:33 PM Reuven Lax <re...@google.com> wrote:
> > > On Fri, May 15, 2020 at 8:10 PM Kenneth Knowles <k...@apache.org> wrote: > >> >> >> On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bhule...@google.com> >> wrote: >> >>> After thinking about this more I've softened on it some, but I'm still a >>> little wary. I like Kenn's suggestion: >>> >>> > - it is OK to convert known logical types like MillisInstant or >>> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL >>> TIME ZONE" >>> > - it is OK to convert unknown logical types into their representation >>> type >>> > - coming out the other side of a SQL statement, a user should expect >>> the type to be a row where every field has a type from the SQL dialect; the >>> input types are gone >>> >>> I think what you're proposing is basically a preprocessing step on >>> SqlTransform that demotes any unrecognized logical types to their >>> representation. Is that a fair characterization? >>> >> >> Exactly >> >> >>> I wonder if we could push the demotion further into the transform to >>> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." - >>> if some logical type SQL knows nothing about is pulled into the SELECT *, >>> it seems reasonable (from a user-perspective) to just pass the type through >>> untouched, and automatically demoting it might be surprising. >>> >> >> To me, it seems fancy to pass the type through. And I expect users will >> soon try to do non-pass-through things involving equality or ordering, >> where only specialized representations work right. So for the very limited >> cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to >> make it happen. If Calcite implicit conversions are a good way to implement >> this, that is fine by me. >> >> Put another way: if the downstream transforms expect to get the logical >> type back, won't automatic schema conversion already "just work"? >> > > No. Let's say you are reading protos and writing protos with a SQL filter > statement in between. If we lose the logical type, then the output schema > may not be able to match against the proto schema. .e.g. if the proto has a > field of type FIXED32, we reflect this with a Fixed32 logical type with a > base type of integer. SQL can operate on the base type just fine because > it's just an int32, but if you strip the logical type from the output then > it will have trouble matching against the proto schema as Beam will infer > an int32 type. > Makes sense. If we implicitly convert only one way then we create an asymmetry. I missed the fact that we wouldn't coerce the representation type back to the logical type. Kenn > >> (I don't have any response to the avro/proto points except I think I >> agree with both of you) >> >> Kenn >> >> What if we had the ability to implicitly convert logical types to their >>> representation? Then the above example would work, because no implicit >>> conversion is necessary, and also relations that introspect the logical >>> type (e.g. projecting f_location.lat) could work by including an implicit >>> conversion. The implicit conversion would still be a one-way operation, but >>> it would only happen when necessary. >>> >>> I really have no idea how complicated this would be to implement. Would >>> it be feasible? Is it ill-advised? >>> >>> Brian >>> >>> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote: >>> >>>> I would not describe the base type as the "wire type." If that were >>>> true, then the only base type we should support should be byte array. >>>> >>>> My simple point is that this is no different than normal schema fields. >>>> You will find many normal schemas containing data encoded into other field >>>> types. You will also find this in SQL databases. If naming these fields >>>> with a logical type causes us to throw up our hands and declare that they >>>> can't be introspectable, then our users will likely just avoid logical >>>> types and encode this data into their own LONG and STRING fields. It will >>>> also mean that SQL will not work well over Avro or Proto input. >>>> >>> Is the Avro/Proto connection really a problem? I would think that we >>> could map most common Avro or Proto types to either primitive types or >>> "standard" logical types which are the same as the ones used by SQL >>> wherever possible, e.g. NanosInstant and NanosDuration added for proto. >>> >>>> >>>> Reuven >>>> >>>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <apill...@google.com> >>>> wrote: >>>> >>>>> My understanding is that the base type is effectively the wire format >>>>> at the type, where the logical type is the in-memory representation for >>>>> Java. For org.joda.time.Instant, this is just a wrapper around the >>>>> underlying Long. However for the Date logical type, the LocalDate type has >>>>> struct as the in-memory representation. There is a performance penalty to >>>>> this conversion and I think there are cases where users (SQL for example) >>>>> would want a fast path that skips the serialization/deserialization steps >>>>> if possible. >>>>> >>>>> As for more complex logical types, I would urge caution in making >>>>> assumptions around the base types. I expect there will be base types of >>>>> strings containing JSON or byte arrays containing protobuf or avro. >>>>> ZetaSQL >>>>> has a civil time module that packs a struct into a Long. All of these are >>>>> things you can write SQL around, but I think the users might be somewhat >>>>> surprised when they come out less than usable by default. >>>>> >>>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com> >>>>>> wrote: >>>>>> >>>>>>> When we created the portable representation of schemas last summer >>>>>>> we intentionally did not include DATETIME as a primitive type [1], even >>>>>>> though it already existed as a primitive type in Java [2]. There seemed >>>>>>> to >>>>>>> be consensus around a couple of points: 1) At the very least DATETIME >>>>>>> is a >>>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a >>>>>>> logical type is better suited for this purpose. >>>>>>> >>>>>>> Since then, it's been my intention to replace Java's DATETIME with a >>>>>>> MillisInstant logical type backed by Long milliseconds since the epoch >>>>>>> (see >>>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so >>>>>>> (and >>>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456 >>>>>>> >>>>>>> There could be a couple of concerns with this PR that I think would >>>>>>> be better discussed here rather than on github. >>>>>>> >>>>>>> >>>>>>> ## Breaking changes in low-level APIs >>>>>>> The PR includes some breaking changes in public, low-level schema >>>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also >>>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant >>>>>>> fields (Long milliseconds since epoch rather than >>>>>>> org.joda.time.Instant). >>>>>>> Both of these changes have the potential to break user code. That being >>>>>>> said I think the risk is justified for a few reasons: >>>>>>> - These are lower-level APIs that we don't intend for most users to >>>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and >>>>>>> inferred >>>>>>> schemas for user types) should seamlessly transition over to the new >>>>>>> MillisInstant logical type. >>>>>>> - Schemas are an experimental feature. >>>>>>> - We can clearly document this breaking change in the release that >>>>>>> includes it. >>>>>>> >>>>>> >>>>>> I am a bit worried about this. Schemas have been marked experimental >>>>>> for. almost two years, and I've found that many Beam users are using >>>>>> them. >>>>>> If we make a breaking change, can we make sure that it is extremely clear >>>>>> to users how to fix their code? I'm afraid we are in the situation now >>>>>> where the code is theoretically experimental but practically is widely >>>>>> used. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> ## Mixing joda and java 8 time >>>>>>> The NanosInstant logical type that Alex added uses java.time.Instant >>>>>>> as it's input type, while my MillisInstant type uses >>>>>>> org.joda.time.Instant >>>>>>> for compatibility with the rest of Beam and the previous DATETIME >>>>>>> primitive >>>>>>> type. It feels weird, but it also makes a certain sort of sense to use >>>>>>> joda >>>>>>> time (which has millisecond precision) for MillisInstant, and java 8 >>>>>>> time >>>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a >>>>>>> significant effect on end users - the schema inference code could >>>>>>> generate >>>>>>> conversions between java 8 time and joda time (as we already do for >>>>>>> converting between various joda time types [5]) so user types can use >>>>>>> either one. >>>>>>> >>>>>> >>>>>> +1 >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> ## Arbitrary Logical Types and SQL >>>>>>> Previously much of the SQL code was written to operate on the _base >>>>>>> type_ value for any logical types. So for the new MillisInstant type, >>>>>>> SQL >>>>>>> would attempt to operate on the underlying Long, rather than on a >>>>>>> org.joda.time.Instant instance. Thus when I switched over to a >>>>>>> MillisInstant logical type as the default for SQL date and time types >>>>>>> any >>>>>>> tests that used them began failing with ClassCastExceptions. >>>>>>> >>>>>>> My solution was just to update SQL code to only ever reference the >>>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good >>>>>>> example >>>>>>> of this type of change is in BeamAggregationRel [6]. >>>>>>> >>>>>>> My change does pass all of the SQL tests (including internal Google >>>>>>> tests), but I'm a little concerned that using the base type throughout >>>>>>> SQL >>>>>>> was a conscious decision that I'm stomping on. In theory, it could >>>>>>> ensure >>>>>>> that SqlTransform would be able to operate on user types that contain >>>>>>> custom user logical types, without requiring SqlTransform to understand >>>>>>> those types. This is just speculation though, there aren't actually any >>>>>>> tests verifying that functionality. Also, I don't think any such tests >>>>>>> would have worked since there are several places where we explicitly >>>>>>> check >>>>>>> for particular logical types (e.g. in BeamCalcRel [7]). >>>>>>> >>>>>>> Personally I'm ok with this move, because I'm against the idea of >>>>>>> implicitly stripping away logical types like this in schema-aware >>>>>>> transforms. A logical type indicates that the value has some additional >>>>>>> meaning beyond just the base type (and maybe a restricted range), and I >>>>>>> don't think transforms should be able to ignore that meaning unless the >>>>>>> user explicitly allows it, or first converts to the base type >>>>>>> themselves. >>>>>>> >>>>>> >>>>>> I disagree with this personally: >>>>>> >>>>>> Arguably, _every_ schema has "additional meaning" beyond the fields, >>>>>> and a LogicalType is just a nice way of naming that and allowing the use >>>>>> of >>>>>> other classes to manage that type. For example, Let's say you have a >>>>>> record >>>>>> containing latitude and longitude fields - today you can easily write SQL >>>>>> over these fields. Later you decide that this is a very common pattern in >>>>>> your data and decide to create a Location logical type; this also lets >>>>>> you >>>>>> represent this with a convenience Location class that has many other >>>>>> helpful features (distance functions, builders, etc.). It seems >>>>>> unfortunate >>>>>> if you suddenly lose the ability to easily run SQL over this type. >>>>>> Introducing a logical type hasn't changed the semantic meaning of your >>>>>> data >>>>>> at. all. All it has done is let you name it, and it would be unfortunate >>>>>> if >>>>>> analyzing your data stream suddenly became harder as a result. >>>>>> >>>>> >>> I would argue that defining the logical type *has* changed the semantic >>> meaning. Before you just had a tuple of floating point numbers that >>> happened to have fields named lat and long. After defining a logical type >>> you have a location, and lat and long have a restricted range. >>> >>> >>>>>> Another reason I don't like this, is that the pattern that we have >>>>>> encouraged (and already use extensively) is to use logical types when >>>>>> converting from external formats such as avro, proto, kryo, etc. If >>>>>> logical types aren't supported by SQL, then we are effectively saying >>>>>> that >>>>>> data coming from these formats can't be easily analyzed with SQL (or SQL >>>>>> will have to special case the logical types used by every single such >>>>>> input >>>>>> format, which does not seem sustainable). >>>>>> >>>>> >>>>>> I think the problem you ran into is that today an aggregation deals >>>>>> with logical types in an all-or-nothing manner. You either aggregate >>>>>> everything by the representation type or everything by the logical type. >>>>>> If >>>>>> we instead allowed SQL to specify well-known logical types (e.g. >>>>>> Instant), >>>>>> I think that this problem would go away. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> If anyone has any concerns with these points could you raise them >>>>>>> here? Otherwise I'd like to move ahead with the PR. >>>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it >>>>>>> would be better served by separate threads, or maybe a doc.) >>>>>>> >>>>>>> Thanks! >>>>>>> Brian >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58 >>>>>>> [2] >>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682 >>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554 >>>>>>> [4] >>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422 >>>>>>> [5] >>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158 >>>>>>> [6] >>>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d >>>>>>> [7] >>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409 >>>>>>> >>>>>>