Re: Removing DATETIME in Java Schemas

Kenneth Knowles Fri, 15 May 2020 13:39:28 -0700

This seems like a good idea. This stuff is all still marked "experimental"
for exactly this reason. This is a case where the name fits perfectly. Both
SQL and schemas are new and still working towards a form that can be
supported indefinitely without layers of workarounds that will never
quiesce. I think the user pain here is minimal. I don't think a deprecation
cycle is terrifically important, but we could do one release.


As for SQL, I believe the following:

 - we have multiple SQL dialects with different type systems
 - our SQL planner/runtime has also its own type system
 - when you are converting a Beam schema to pass to SqlTransform, you need
to provide a conversion from the Beam schema type system to the SQL
dialect's type system (sometimes the conversion is a noop)

So my thinking about the conversion:

 - it is OK to convert known logical types like MillisInstant or
NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
TIME ZONE"
 - it is OK to convert unknown logical types into their representation type
 - coming out the other side of a SQL statement, a user should expect the
type to be a row where every field has a type from the SQL dialect; the
input types are gone

I've heard some arguments (on list? offline?) that operating directly on
the representation of a logical type is wrong, because you can't really
know that what you are doing results in a valid value. I agree with that,
basically. That's why I frame it as a one-way conversion from the input
type to the SQL dialect's type. You aren't operating on the logical type
any more.

I feel I am probably forgetting some important details about the "logical
type inlining" discussion, so others should speak up if they have more to
add.

There is also a backwards-compatibility lesson here: don't show users your
enums if you can avoid it. If DATETIME had always been accessed via a
(trivial) factory method then changing the implementation would be
backwards compatible.

Kenn

On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <[email protected]> wrote:

> When we created the portable representation of schemas last summer we
> intentionally did not include DATETIME as a primitive type [1], even though
> it already existed as a primitive type in Java [2]. There seemed to be
> consensus around a couple of points: 1) At the very least DATETIME is a
> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
> logical type is better suited for this purpose.
>
> Since then, it's been my intention to replace Java's DATETIME with a
> MillisInstant logical type backed by Long milliseconds since the epoch (see
> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
> keeps tests passing): https://github.com/apache/beam/pull/11456
>
> There could be a couple of concerns with this PR that I think would be
> better discussed here rather than on github.
>
>
> ## Breaking changes in low-level APIs
> The PR includes some breaking changes in public, low-level schema APIs: It
> removes DATETIME from the TypeName enum [4], and it will also change the
> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
> milliseconds since epoch rather than org.joda.time.Instant). Both of these
> changes have the potential to break user code. That being said I think the
> risk is justified for a few reasons:
> - These are lower-level APIs that we don't intend for most users to use.
> The preferred higher-level APIs (SQL, Schema transforms, and inferred
> schemas for user types) should seamlessly transition over to the new
> MillisInstant logical type.
> - Schemas are an experimental feature.
> - We can clearly document this breaking change in the release that
> includes it.
>
>
> ## Mixing joda and java 8 time
> The NanosInstant logical type that Alex added uses java.time.Instant as
> it's input type, while my MillisInstant type uses org.joda.time.Instant for
> compatibility with the rest of Beam and the previous DATETIME primitive
> type. It feels weird, but it also makes a certain sort of sense to use joda
> time (which has millisecond precision) for MillisInstant, and java 8 time
> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
> significant effect on end users - the schema inference code could generate
> conversions between java 8 time and joda time (as we already do for
> converting between various joda time types [5]) so user types can use
> either one.
>
>
> ## Arbitrary Logical Types and SQL
> Previously much of the SQL code was written to operate on the _base type_
> value for any logical types. So for the new MillisInstant type, SQL would
> attempt to operate on the underlying Long, rather than on a
> org.joda.time.Instant instance. Thus when I switched over to a
> MillisInstant logical type as the default for SQL date and time types any
> tests that used them began failing with ClassCastExceptions.
>
> My solution was just to update SQL code to only ever reference the input
> type (i.e. org.joda.time.Instant) for logical types. A good example of this
> type of change is in BeamAggregationRel [6].
>
> My change does pass all of the SQL tests (including internal Google
> tests), but I'm a little concerned that using the base type throughout SQL
> was a conscious decision that I'm stomping on. In theory, it could ensure
> that SqlTransform would be able to operate on user types that contain
> custom user logical types, without requiring SqlTransform to understand
> those types. This is just speculation though, there aren't actually any
> tests verifying that functionality. Also, I don't think any such tests
> would have worked since there are several places where we explicitly check
> for particular logical types (e.g. in BeamCalcRel [7]).
>
> Personally I'm ok with this move, because I'm against the idea of
> implicitly stripping away logical types like this in schema-aware
> transforms. A logical type indicates that the value has some additional
> meaning beyond just the base type (and maybe a restricted range), and I
> don't think transforms should be able to ignore that meaning unless the
> user explicitly allows it, or first converts to the base type themselves.
>
>
>
> If anyone has any concerns with these points could you raise them here?
> Otherwise I'd like to move ahead with the PR.
> (Sorry this message turned out to be a bit of a novel. LMK if it would be
> better served by separate threads, or maybe a doc.)
>
> Thanks!
> Brian
>
> [1]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
> [2]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
> [3] https://issues.apache.org/jira/browse/BEAM-7554
> [4]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
> [5]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
> [6]
> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
> [7]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>

Re: Removing DATETIME in Java Schemas

Reply via email to