When we created the portable representation of schemas last summer we
intentionally did not include DATETIME as a primitive type [1], even though
it already existed as a primitive type in Java [2]. There seemed to be
consensus around a couple of points: 1) At the very least DATETIME is a
poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
logical type is better suited for this purpose.

Since then, it's been my intention to replace Java's DATETIME with a
MillisInstant logical type backed by Long milliseconds since the epoch (see
BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
keeps tests passing): https://github.com/apache/beam/pull/11456

There could be a couple of concerns with this PR that I think would be
better discussed here rather than on github.


## Breaking changes in low-level APIs
The PR includes some breaking changes in public, low-level schema APIs: It
removes DATETIME from the TypeName enum [4], and it will also change the
type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
milliseconds since epoch rather than org.joda.time.Instant). Both of these
changes have the potential to break user code. That being said I think the
risk is justified for a few reasons:
- These are lower-level APIs that we don't intend for most users to use.
The preferred higher-level APIs (SQL, Schema transforms, and inferred
schemas for user types) should seamlessly transition over to the new
MillisInstant logical type.
- Schemas are an experimental feature.
- We can clearly document this breaking change in the release that includes
it.


## Mixing joda and java 8 time
The NanosInstant logical type that Alex added uses java.time.Instant as
it's input type, while my MillisInstant type uses org.joda.time.Instant for
compatibility with the rest of Beam and the previous DATETIME primitive
type. It feels weird, but it also makes a certain sort of sense to use joda
time (which has millisecond precision) for MillisInstant, and java 8 time
(which has nanos) for NanosInstant. Also, the choice shouldn't have a
significant effect on end users - the schema inference code could generate
conversions between java 8 time and joda time (as we already do for
converting between various joda time types [5]) so user types can use
either one.


## Arbitrary Logical Types and SQL
Previously much of the SQL code was written to operate on the _base type_
value for any logical types. So for the new MillisInstant type, SQL would
attempt to operate on the underlying Long, rather than on a
org.joda.time.Instant instance. Thus when I switched over to a
MillisInstant logical type as the default for SQL date and time types any
tests that used them began failing with ClassCastExceptions.

My solution was just to update SQL code to only ever reference the input
type (i.e. org.joda.time.Instant) for logical types. A good example of this
type of change is in BeamAggregationRel [6].

My change does pass all of the SQL tests (including internal Google tests),
but I'm a little concerned that using the base type throughout SQL was a
conscious decision that I'm stomping on. In theory, it could ensure that
SqlTransform would be able to operate on user types that contain custom
user logical types, without requiring SqlTransform to understand those
types. This is just speculation though, there aren't actually any tests
verifying that functionality. Also, I don't think any such tests would have
worked since there are several places where we explicitly check for
particular logical types (e.g. in BeamCalcRel [7]).

Personally I'm ok with this move, because I'm against the idea of
implicitly stripping away logical types like this in schema-aware
transforms. A logical type indicates that the value has some additional
meaning beyond just the base type (and maybe a restricted range), and I
don't think transforms should be able to ignore that meaning unless the
user explicitly allows it, or first converts to the base type themselves.



If anyone has any concerns with these points could you raise them here?
Otherwise I'd like to move ahead with the PR.
(Sorry this message turned out to be a bit of a novel. LMK if it would be
better served by separate threads, or maybe a doc.)

Thanks!
Brian

[1]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
[2]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
[3] https://issues.apache.org/jira/browse/BEAM-7554
[4]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
[5]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
[6]
https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
[7]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409

Reply via email to