This seems like a good idea. This stuff is all still marked "experimental" for exactly this reason. This is a case where the name fits perfectly. Both SQL and schemas are new and still working towards a form that can be supported indefinitely without layers of workarounds that will never quiesce. I think the user pain here is minimal. I don't think a deprecation cycle is terrifically important, but we could do one release.
As for SQL, I believe the following: - we have multiple SQL dialects with different type systems - our SQL planner/runtime has also its own type system - when you are converting a Beam schema to pass to SqlTransform, you need to provide a conversion from the Beam schema type system to the SQL dialect's type system (sometimes the conversion is a noop) So my thinking about the conversion: - it is OK to convert known logical types like MillisInstant or NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL TIME ZONE" - it is OK to convert unknown logical types into their representation type - coming out the other side of a SQL statement, a user should expect the type to be a row where every field has a type from the SQL dialect; the input types are gone I've heard some arguments (on list? offline?) that operating directly on the representation of a logical type is wrong, because you can't really know that what you are doing results in a valid value. I agree with that, basically. That's why I frame it as a one-way conversion from the input type to the SQL dialect's type. You aren't operating on the logical type any more. I feel I am probably forgetting some important details about the "logical type inlining" discussion, so others should speak up if they have more to add. There is also a backwards-compatibility lesson here: don't show users your enums if you can avoid it. If DATETIME had always been accessed via a (trivial) factory method then changing the implementation would be backwards compatible. Kenn On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com> wrote: > When we created the portable representation of schemas last summer we > intentionally did not include DATETIME as a primitive type [1], even though > it already existed as a primitive type in Java [2]. There seemed to be > consensus around a couple of points: 1) At the very least DATETIME is a > poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a > logical type is better suited for this purpose. > > Since then, it's been my intention to replace Java's DATETIME with a > MillisInstant logical type backed by Long milliseconds since the epoch (see > BEAM-7554 [3]). I've finally managed to get a PR together that does so (and > keeps tests passing): https://github.com/apache/beam/pull/11456 > > There could be a couple of concerns with this PR that I think would be > better discussed here rather than on github. > > > ## Breaking changes in low-level APIs > The PR includes some breaking changes in public, low-level schema APIs: It > removes DATETIME from the TypeName enum [4], and it will also change the > type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long > milliseconds since epoch rather than org.joda.time.Instant). Both of these > changes have the potential to break user code. That being said I think the > risk is justified for a few reasons: > - These are lower-level APIs that we don't intend for most users to use. > The preferred higher-level APIs (SQL, Schema transforms, and inferred > schemas for user types) should seamlessly transition over to the new > MillisInstant logical type. > - Schemas are an experimental feature. > - We can clearly document this breaking change in the release that > includes it. > > > ## Mixing joda and java 8 time > The NanosInstant logical type that Alex added uses java.time.Instant as > it's input type, while my MillisInstant type uses org.joda.time.Instant for > compatibility with the rest of Beam and the previous DATETIME primitive > type. It feels weird, but it also makes a certain sort of sense to use joda > time (which has millisecond precision) for MillisInstant, and java 8 time > (which has nanos) for NanosInstant. Also, the choice shouldn't have a > significant effect on end users - the schema inference code could generate > conversions between java 8 time and joda time (as we already do for > converting between various joda time types [5]) so user types can use > either one. > > > ## Arbitrary Logical Types and SQL > Previously much of the SQL code was written to operate on the _base type_ > value for any logical types. So for the new MillisInstant type, SQL would > attempt to operate on the underlying Long, rather than on a > org.joda.time.Instant instance. Thus when I switched over to a > MillisInstant logical type as the default for SQL date and time types any > tests that used them began failing with ClassCastExceptions. > > My solution was just to update SQL code to only ever reference the input > type (i.e. org.joda.time.Instant) for logical types. A good example of this > type of change is in BeamAggregationRel [6]. > > My change does pass all of the SQL tests (including internal Google > tests), but I'm a little concerned that using the base type throughout SQL > was a conscious decision that I'm stomping on. In theory, it could ensure > that SqlTransform would be able to operate on user types that contain > custom user logical types, without requiring SqlTransform to understand > those types. This is just speculation though, there aren't actually any > tests verifying that functionality. Also, I don't think any such tests > would have worked since there are several places where we explicitly check > for particular logical types (e.g. in BeamCalcRel [7]). > > Personally I'm ok with this move, because I'm against the idea of > implicitly stripping away logical types like this in schema-aware > transforms. A logical type indicates that the value has some additional > meaning beyond just the base type (and maybe a restricted range), and I > don't think transforms should be able to ignore that meaning unless the > user explicitly allows it, or first converts to the base type themselves. > > > > If anyone has any concerns with these points could you raise them here? > Otherwise I'd like to move ahead with the PR. > (Sorry this message turned out to be a bit of a novel. LMK if it would be > better served by separate threads, or maybe a doc.) > > Thanks! > Brian > > [1] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58 > [2] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682 > [3] https://issues.apache.org/jira/browse/BEAM-7554 > [4] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422 > [5] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158 > [6] > https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d > [7] > https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409 >