My understanding is that the base type is effectively the wire format at the type, where the logical type is the in-memory representation for Java. For org.joda.time.Instant, this is just a wrapper around the underlying Long. However for the Date logical type, the LocalDate type has struct as the in-memory representation. There is a performance penalty to this conversion and I think there are cases where users (SQL for example) would want a fast path that skips the serialization/deserialization steps if possible.
As for more complex logical types, I would urge caution in making assumptions around the base types. I expect there will be base types of strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL has a civil time module that packs a struct into a Long. All of these are things you can write SQL around, but I think the users might be somewhat surprised when they come out less than usable by default. On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote: > > > On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com> > wrote: > >> When we created the portable representation of schemas last summer we >> intentionally did not include DATETIME as a primitive type [1], even though >> it already existed as a primitive type in Java [2]. There seemed to be >> consensus around a couple of points: 1) At the very least DATETIME is a >> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a >> logical type is better suited for this purpose. >> >> Since then, it's been my intention to replace Java's DATETIME with a >> MillisInstant logical type backed by Long milliseconds since the epoch (see >> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and >> keeps tests passing): https://github.com/apache/beam/pull/11456 >> >> There could be a couple of concerns with this PR that I think would be >> better discussed here rather than on github. >> >> >> ## Breaking changes in low-level APIs >> The PR includes some breaking changes in public, low-level schema APIs: >> It removes DATETIME from the TypeName enum [4], and it will also change the >> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long >> milliseconds since epoch rather than org.joda.time.Instant). Both of these >> changes have the potential to break user code. That being said I think the >> risk is justified for a few reasons: >> - These are lower-level APIs that we don't intend for most users to use. >> The preferred higher-level APIs (SQL, Schema transforms, and inferred >> schemas for user types) should seamlessly transition over to the new >> MillisInstant logical type. >> - Schemas are an experimental feature. >> - We can clearly document this breaking change in the release that >> includes it. >> > > I am a bit worried about this. Schemas have been marked experimental for. > almost two years, and I've found that many Beam users are using them. If we > make a breaking change, can we make sure that it is extremely clear to > users how to fix their code? I'm afraid we are in the situation now where > the code is theoretically experimental but practically is widely used. > > > >> >> >> ## Mixing joda and java 8 time >> The NanosInstant logical type that Alex added uses java.time.Instant as >> it's input type, while my MillisInstant type uses org.joda.time.Instant for >> compatibility with the rest of Beam and the previous DATETIME primitive >> type. It feels weird, but it also makes a certain sort of sense to use joda >> time (which has millisecond precision) for MillisInstant, and java 8 time >> (which has nanos) for NanosInstant. Also, the choice shouldn't have a >> significant effect on end users - the schema inference code could generate >> conversions between java 8 time and joda time (as we already do for >> converting between various joda time types [5]) so user types can use >> either one. >> > > +1 > > >> >> >> ## Arbitrary Logical Types and SQL >> Previously much of the SQL code was written to operate on the _base type_ >> value for any logical types. So for the new MillisInstant type, SQL would >> attempt to operate on the underlying Long, rather than on a >> org.joda.time.Instant instance. Thus when I switched over to a >> MillisInstant logical type as the default for SQL date and time types any >> tests that used them began failing with ClassCastExceptions. >> >> My solution was just to update SQL code to only ever reference the input >> type (i.e. org.joda.time.Instant) for logical types. A good example of this >> type of change is in BeamAggregationRel [6]. >> >> My change does pass all of the SQL tests (including internal Google >> tests), but I'm a little concerned that using the base type throughout SQL >> was a conscious decision that I'm stomping on. In theory, it could ensure >> that SqlTransform would be able to operate on user types that contain >> custom user logical types, without requiring SqlTransform to understand >> those types. This is just speculation though, there aren't actually any >> tests verifying that functionality. Also, I don't think any such tests >> would have worked since there are several places where we explicitly check >> for particular logical types (e.g. in BeamCalcRel [7]). >> >> Personally I'm ok with this move, because I'm against the idea of >> implicitly stripping away logical types like this in schema-aware >> transforms. A logical type indicates that the value has some additional >> meaning beyond just the base type (and maybe a restricted range), and I >> don't think transforms should be able to ignore that meaning unless the >> user explicitly allows it, or first converts to the base type themselves. >> > > I disagree with this personally: > > Arguably, _every_ schema has "additional meaning" beyond the fields, and a > LogicalType is just a nice way of naming that and allowing the use of other > classes to manage that type. For example, Let's say you have a record > containing latitude and longitude fields - today you can easily write SQL > over these fields. Later you decide that this is a very common pattern in > your data and decide to create a Location logical type; this also lets you > represent this with a convenience Location class that has many other > helpful features (distance functions, builders, etc.). It seems unfortunate > if you suddenly lose the ability to easily run SQL over this type. > Introducing a logical type hasn't changed the semantic meaning of your data > at. all. All it has done is let you name it, and it would be unfortunate if > analyzing your data stream suddenly became harder as a result. > > Another reason I don't like this, is that the pattern that we have > encouraged (and already use extensively) is to use logical types when > converting from external formats such as avro, proto, kryo, etc. If > logical types aren't supported by SQL, then we are effectively saying that > data coming from these formats can't be easily analyzed with SQL (or SQL > will have to special case the logical types used by every single such input > format, which does not seem sustainable). > > I think the problem you ran into is that today an aggregation deals with > logical types in an all-or-nothing manner. You either aggregate everything > by the representation type or everything by the logical type. If we instead > allowed SQL to specify well-known logical types (e.g. Instant), I think > that this problem would go away. > > >> >> >> If anyone has any concerns with these points could you raise them here? >> Otherwise I'd like to move ahead with the PR. >> (Sorry this message turned out to be a bit of a novel. LMK if it would be >> better served by separate threads, or maybe a doc.) >> >> Thanks! >> Brian >> >> [1] >> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58 >> [2] >> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682 >> [3] https://issues.apache.org/jira/browse/BEAM-7554 >> [4] >> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422 >> [5] >> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158 >> [6] >> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d >> [7] >> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409 >> >