On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bhule...@google.com> wrote:
> After thinking about this more I've softened on it some, but I'm still a > little wary. I like Kenn's suggestion: > > > - it is OK to convert known logical types like MillisInstant or > NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL > TIME ZONE" > > - it is OK to convert unknown logical types into their representation > type > > - coming out the other side of a SQL statement, a user should expect the > type to be a row where every field has a type from the SQL dialect; the > input types are gone > > I think what you're proposing is basically a preprocessing step on > SqlTransform that demotes any unrecognized logical types to their > representation. Is that a fair characterization? > Exactly > I wonder if we could push the demotion further into the transform to > enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." - > if some logical type SQL knows nothing about is pulled into the SELECT *, > it seems reasonable (from a user-perspective) to just pass the type through > untouched, and automatically demoting it might be surprising. > To me, it seems fancy to pass the type through. And I expect users will soon try to do non-pass-through things involving equality or ordering, where only specialized representations work right. So for the very limited cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to make it happen. If Calcite implicit conversions are a good way to implement this, that is fine by me. Put another way: if the downstream transforms expect to get the logical type back, won't automatic schema conversion already "just work"? (I don't have any response to the avro/proto points except I think I agree with both of you) Kenn What if we had the ability to implicitly convert logical types to their > representation? Then the above example would work, because no implicit > conversion is necessary, and also relations that introspect the logical > type (e.g. projecting f_location.lat) could work by including an implicit > conversion. The implicit conversion would still be a one-way operation, but > it would only happen when necessary. > > I really have no idea how complicated this would be to implement. Would it > be feasible? Is it ill-advised? > > Brian > > On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote: > >> I would not describe the base type as the "wire type." If that were true, >> then the only base type we should support should be byte array. >> >> My simple point is that this is no different than normal schema fields. >> You will find many normal schemas containing data encoded into other field >> types. You will also find this in SQL databases. If naming these fields >> with a logical type causes us to throw up our hands and declare that they >> can't be introspectable, then our users will likely just avoid logical >> types and encode this data into their own LONG and STRING fields. It will >> also mean that SQL will not work well over Avro or Proto input. >> > Is the Avro/Proto connection really a problem? I would think that we could > map most common Avro or Proto types to either primitive types or "standard" > logical types which are the same as the ones used by SQL wherever possible, > e.g. NanosInstant and NanosDuration added for proto. > >> >> Reuven >> >> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <apill...@google.com> >> wrote: >> >>> My understanding is that the base type is effectively the wire format at >>> the type, where the logical type is the in-memory representation for Java. >>> For org.joda.time.Instant, this is just a wrapper around the underlying >>> Long. However for the Date logical type, the LocalDate type has struct as >>> the in-memory representation. There is a performance penalty to this >>> conversion and I think there are cases where users (SQL for example) would >>> want a fast path that skips the serialization/deserialization steps if >>> possible. >>> >>> As for more complex logical types, I would urge caution in making >>> assumptions around the base types. I expect there will be base types of >>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL >>> has a civil time module that packs a struct into a Long. All of these are >>> things you can write SQL around, but I think the users might be somewhat >>> surprised when they come out less than usable by default. >>> >>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote: >>> >>>> >>>> >>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com> >>>> wrote: >>>> >>>>> When we created the portable representation of schemas last summer we >>>>> intentionally did not include DATETIME as a primitive type [1], even >>>>> though >>>>> it already existed as a primitive type in Java [2]. There seemed to be >>>>> consensus around a couple of points: 1) At the very least DATETIME is a >>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a >>>>> logical type is better suited for this purpose. >>>>> >>>>> Since then, it's been my intention to replace Java's DATETIME with a >>>>> MillisInstant logical type backed by Long milliseconds since the epoch >>>>> (see >>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so >>>>> (and >>>>> keeps tests passing): https://github.com/apache/beam/pull/11456 >>>>> >>>>> There could be a couple of concerns with this PR that I think would be >>>>> better discussed here rather than on github. >>>>> >>>>> >>>>> ## Breaking changes in low-level APIs >>>>> The PR includes some breaking changes in public, low-level schema >>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also >>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant >>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant). >>>>> Both of these changes have the potential to break user code. That being >>>>> said I think the risk is justified for a few reasons: >>>>> - These are lower-level APIs that we don't intend for most users to >>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred >>>>> schemas for user types) should seamlessly transition over to the new >>>>> MillisInstant logical type. >>>>> - Schemas are an experimental feature. >>>>> - We can clearly document this breaking change in the release that >>>>> includes it. >>>>> >>>> >>>> I am a bit worried about this. Schemas have been marked experimental >>>> for. almost two years, and I've found that many Beam users are using them. >>>> If we make a breaking change, can we make sure that it is extremely clear >>>> to users how to fix their code? I'm afraid we are in the situation now >>>> where the code is theoretically experimental but practically is widely >>>> used. >>>> >>>> >>>> >>>>> >>>>> >>>>> ## Mixing joda and java 8 time >>>>> The NanosInstant logical type that Alex added uses java.time.Instant >>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant >>>>> for compatibility with the rest of Beam and the previous DATETIME >>>>> primitive >>>>> type. It feels weird, but it also makes a certain sort of sense to use >>>>> joda >>>>> time (which has millisecond precision) for MillisInstant, and java 8 time >>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a >>>>> significant effect on end users - the schema inference code could generate >>>>> conversions between java 8 time and joda time (as we already do for >>>>> converting between various joda time types [5]) so user types can use >>>>> either one. >>>>> >>>> >>>> +1 >>>> >>>> >>>>> >>>>> >>>>> ## Arbitrary Logical Types and SQL >>>>> Previously much of the SQL code was written to operate on the _base >>>>> type_ value for any logical types. So for the new MillisInstant type, SQL >>>>> would attempt to operate on the underlying Long, rather than on a >>>>> org.joda.time.Instant instance. Thus when I switched over to a >>>>> MillisInstant logical type as the default for SQL date and time types any >>>>> tests that used them began failing with ClassCastExceptions. >>>>> >>>>> My solution was just to update SQL code to only ever reference the >>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example >>>>> of this type of change is in BeamAggregationRel [6]. >>>>> >>>>> My change does pass all of the SQL tests (including internal Google >>>>> tests), but I'm a little concerned that using the base type throughout SQL >>>>> was a conscious decision that I'm stomping on. In theory, it could ensure >>>>> that SqlTransform would be able to operate on user types that contain >>>>> custom user logical types, without requiring SqlTransform to understand >>>>> those types. This is just speculation though, there aren't actually any >>>>> tests verifying that functionality. Also, I don't think any such tests >>>>> would have worked since there are several places where we explicitly check >>>>> for particular logical types (e.g. in BeamCalcRel [7]). >>>>> >>>>> Personally I'm ok with this move, because I'm against the idea of >>>>> implicitly stripping away logical types like this in schema-aware >>>>> transforms. A logical type indicates that the value has some additional >>>>> meaning beyond just the base type (and maybe a restricted range), and I >>>>> don't think transforms should be able to ignore that meaning unless the >>>>> user explicitly allows it, or first converts to the base type themselves. >>>>> >>>> >>>> I disagree with this personally: >>>> >>>> Arguably, _every_ schema has "additional meaning" beyond the fields, >>>> and a LogicalType is just a nice way of naming that and allowing the use of >>>> other classes to manage that type. For example, Let's say you have a record >>>> containing latitude and longitude fields - today you can easily write SQL >>>> over these fields. Later you decide that this is a very common pattern in >>>> your data and decide to create a Location logical type; this also lets you >>>> represent this with a convenience Location class that has many other >>>> helpful features (distance functions, builders, etc.). It seems unfortunate >>>> if you suddenly lose the ability to easily run SQL over this type. >>>> Introducing a logical type hasn't changed the semantic meaning of your data >>>> at. all. All it has done is let you name it, and it would be unfortunate if >>>> analyzing your data stream suddenly became harder as a result. >>>> >>> > I would argue that defining the logical type *has* changed the semantic > meaning. Before you just had a tuple of floating point numbers that > happened to have fields named lat and long. After defining a logical type > you have a location, and lat and long have a restricted range. > > >>>> Another reason I don't like this, is that the pattern that we have >>>> encouraged (and already use extensively) is to use logical types when >>>> converting from external formats such as avro, proto, kryo, etc. If >>>> logical types aren't supported by SQL, then we are effectively saying that >>>> data coming from these formats can't be easily analyzed with SQL (or SQL >>>> will have to special case the logical types used by every single such input >>>> format, which does not seem sustainable). >>>> >>> >>>> I think the problem you ran into is that today an aggregation deals >>>> with logical types in an all-or-nothing manner. You either aggregate >>>> everything by the representation type or everything by the logical type. If >>>> we instead allowed SQL to specify well-known logical types (e.g. Instant), >>>> I think that this problem would go away. >>>> >>>> >>>>> >>>>> >>>>> If anyone has any concerns with these points could you raise them >>>>> here? Otherwise I'd like to move ahead with the PR. >>>>> (Sorry this message turned out to be a bit of a novel. LMK if it would >>>>> be better served by separate threads, or maybe a doc.) >>>>> >>>>> Thanks! >>>>> Brian >>>>> >>>>> [1] >>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58 >>>>> [2] >>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682 >>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554 >>>>> [4] >>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422 >>>>> [5] >>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158 >>>>> [6] >>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d >>>>> [7] >>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409 >>>>> >>>>