On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com> wrote:
> When we created the portable representation of schemas last summer we > intentionally did not include DATETIME as a primitive type [1], even though > it already existed as a primitive type in Java [2]. There seemed to be > consensus around a couple of points: 1) At the very least DATETIME is a > poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a > logical type is better suited for this purpose. > > Since then, it's been my intention to replace Java's DATETIME with a > MillisInstant logical type backed by Long milliseconds since the epoch (see > BEAM-7554 [3]). I've finally managed to get a PR together that does so (and > keeps tests passing): https://github.com/apache/beam/pull/11456 > > There could be a couple of concerns with this PR that I think would be > better discussed here rather than on github. > > > ## Breaking changes in low-level APIs > The PR includes some breaking changes in public, low-level schema APIs: It > removes DATETIME from the TypeName enum [4], and it will also change the > type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long > milliseconds since epoch rather than org.joda.time.Instant). Both of these > changes have the potential to break user code. That being said I think the > risk is justified for a few reasons: > - These are lower-level APIs that we don't intend for most users to use. > The preferred higher-level APIs (SQL, Schema transforms, and inferred > schemas for user types) should seamlessly transition over to the new > MillisInstant logical type. > - Schemas are an experimental feature. > - We can clearly document this breaking change in the release that > includes it. > I am a bit worried about this. Schemas have been marked experimental for. almost two years, and I've found that many Beam users are using them. If we make a breaking change, can we make sure that it is extremely clear to users how to fix their code? I'm afraid we are in the situation now where the code is theoretically experimental but practically is widely used. > > > ## Mixing joda and java 8 time > The NanosInstant logical type that Alex added uses java.time.Instant as > it's input type, while my MillisInstant type uses org.joda.time.Instant for > compatibility with the rest of Beam and the previous DATETIME primitive > type. It feels weird, but it also makes a certain sort of sense to use joda > time (which has millisecond precision) for MillisInstant, and java 8 time > (which has nanos) for NanosInstant. Also, the choice shouldn't have a > significant effect on end users - the schema inference code could generate > conversions between java 8 time and joda time (as we already do for > converting between various joda time types [5]) so user types can use > either one. > +1 > > > ## Arbitrary Logical Types and SQL > Previously much of the SQL code was written to operate on the _base type_ > value for any logical types. So for the new MillisInstant type, SQL would > attempt to operate on the underlying Long, rather than on a > org.joda.time.Instant instance. Thus when I switched over to a > MillisInstant logical type as the default for SQL date and time types any > tests that used them began failing with ClassCastExceptions. > > My solution was just to update SQL code to only ever reference the input > type (i.e. org.joda.time.Instant) for logical types. A good example of this > type of change is in BeamAggregationRel [6]. > > My change does pass all of the SQL tests (including internal Google > tests), but I'm a little concerned that using the base type throughout SQL > was a conscious decision that I'm stomping on. In theory, it could ensure > that SqlTransform would be able to operate on user types that contain > custom user logical types, without requiring SqlTransform to understand > those types. This is just speculation though, there aren't actually any > tests verifying that functionality. Also, I don't think any such tests > would have worked since there are several places where we explicitly check > for particular logical types (e.g. in BeamCalcRel [7]). > > Personally I'm ok with this move, because I'm against the idea of > implicitly stripping away logical types like this in schema-aware > transforms. A logical type indicates that the value has some additional > meaning beyond just the base type (and maybe a restricted range), and I > don't think transforms should be able to ignore that meaning unless the > user explicitly allows it, or first converts to the base type themselves. > I disagree with this personally: Arguably, _every_ schema has "additional meaning" beyond the fields, and a LogicalType is just a nice way of naming that and allowing the use of other classes to manage that type. For example, Let's say you have a record containing latitude and longitude fields - today you can easily write SQL over these fields. Later you decide that this is a very common pattern in your data and decide to create a Location logical type; this also lets you represent this with a convenience Location class that has many other helpful features (distance functions, builders, etc.). It seems unfortunate if you suddenly lose the ability to easily run SQL over this type. Introducing a logical type hasn't changed the semantic meaning of your data at. all. All it has done is let you name it, and it would be unfortunate if analyzing your data stream suddenly became harder as a result. Another reason I don't like this, is that the pattern that we have encouraged (and already use extensively) is to use logical types when converting from external formats such as avro, proto, kryo, etc. If logical types aren't supported by SQL, then we are effectively saying that data coming from these formats can't be easily analyzed with SQL (or SQL will have to special case the logical types used by every single such input format, which does not seem sustainable). I think the problem you ran into is that today an aggregation deals with logical types in an all-or-nothing manner. You either aggregate everything by the representation type or everything by the logical type. If we instead allowed SQL to specify well-known logical types (e.g. Instant), I think that this problem would go away. > > > If anyone has any concerns with these points could you raise them here? > Otherwise I'd like to move ahead with the PR. > (Sorry this message turned out to be a bit of a novel. LMK if it would be > better served by separate threads, or maybe a doc.) > > Thanks! > Brian > > [1] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58 > [2] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682 > [3] https://issues.apache.org/jira/browse/BEAM-7554 > [4] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422 > [5] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158 > [6] > https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d > [7] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409 >