Re: Removing DATETIME in Java Schemas

Kenneth Knowles Fri, 15 May 2020 20:11:24 -0700

On Fri, May 15, 2020 at 5:25 PM Brian Hulette <[email protected]> wrote:


> After thinking about this more I've softened on it some, but I'm still a
> little wary. I like Kenn's suggestion:
>
> > - it is OK to convert known logical types like MillisInstant or
> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
> TIME ZONE"
> > - it is OK to convert unknown logical types into their representation
> type
> > - coming out the other side of a SQL statement, a user should expect the
> type to be a row where every field has a type from the SQL dialect; the
> input types are gone
>
> I think what you're proposing is basically a preprocessing step on
> SqlTransform that demotes any unrecognized logical types to their
> representation. Is that a fair characterization?
>

Exactly


> I wonder if we could push the demotion further into the transform to
> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
> if some logical type SQL knows nothing about is pulled into the SELECT *,
> it seems reasonable (from a user-perspective) to just pass the type through
> untouched, and automatically demoting it might be surprising.
>

To me, it seems fancy to pass the type through. And I expect users will
soon try to do non-pass-through things involving equality or ordering,
where only specialized representations work right. So for the very limited
cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to
make it happen. If Calcite implicit conversions are a good way to implement
this, that is fine by me.

Put another way: if the downstream transforms expect to get the logical
type back, won't automatic schema conversion already "just work"?

(I don't have any response to the avro/proto points except I think I agree
with both of you)

Kenn

What if we had the ability to implicitly convert logical types to their
> representation? Then the above example would work, because no implicit
> conversion is necessary, and also relations that introspect the logical
> type (e.g. projecting f_location.lat) could work by including an implicit
> conversion. The implicit conversion would still be a one-way operation, but
> it would only happen when necessary.
>
> I really have no idea how complicated this would be to implement. Would it
> be feasible? Is it ill-advised?
>
> Brian
>
> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <[email protected]> wrote:
>
>> I would not describe the base type as the "wire type." If that were true,
>> then the only base type we should support should be byte array.
>>
>> My simple point is that this is no different than normal schema fields.
>> You will find many normal schemas containing data encoded into other field
>> types. You will also find this in SQL databases. If naming these fields
>> with a logical type causes us to throw up our hands and declare that they
>> can't be introspectable, then our users will likely just avoid logical
>> types and encode this data into their own LONG and STRING fields. It will
>> also mean that SQL will not work well over Avro or Proto input.
>>
> Is the Avro/Proto connection really a problem? I would think that we could
> map most common Avro or Proto types to either primitive types or "standard"
> logical types which are the same as the ones used by SQL wherever possible,
> e.g. NanosInstant and NanosDuration added for proto.
>
>>
>> Reuven
>>
>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <[email protected]>
>> wrote:
>>
>>> My understanding is that the base type is effectively the wire format at
>>> the type, where the logical type is the in-memory representation for Java.
>>> For org.joda.time.Instant, this is just a wrapper around the underlying
>>> Long. However for the Date logical type, the LocalDate type has struct as
>>> the in-memory representation. There is a performance penalty to this
>>> conversion and I think there are cases where users (SQL for example) would
>>> want a fast path that skips the serialization/deserialization steps if
>>> possible.
>>>
>>> As for more complex logical types, I would urge caution in making
>>> assumptions around the base types. I expect there will be base types of
>>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>>> has a civil time module that packs a struct into a Long. All of these are
>>> things you can write SQL around, but I think the users might be somewhat
>>> surprised when they come out less than usable by default.
>>>
>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> When we created the portable representation of schemas last summer we
>>>>> intentionally did not include DATETIME as a primitive type [1], even 
>>>>> though
>>>>> it already existed as a primitive type in Java [2]. There seemed to be
>>>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>> logical type is better suited for this purpose.
>>>>>
>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>> MillisInstant logical type backed by Long milliseconds since the epoch 
>>>>> (see
>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so 
>>>>> (and
>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>
>>>>> There could be a couple of concerns with this PR that I think would be
>>>>> better discussed here rather than on github.
>>>>>
>>>>>
>>>>> ## Breaking changes in low-level APIs
>>>>> The PR includes some breaking changes in public, low-level schema
>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant).
>>>>> Both of these changes have the potential to break user code. That being
>>>>> said I think the risk is justified for a few reasons:
>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>>> schemas for user types) should seamlessly transition over to the new
>>>>> MillisInstant logical type.
>>>>> - Schemas are an experimental feature.
>>>>> - We can clearly document this breaking change in the release that
>>>>> includes it.
>>>>>
>>>>
>>>> I am a bit worried about this. Schemas have been marked experimental
>>>> for. almost two years, and I've found that many Beam users are using them.
>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>> where the code is theoretically experimental but practically is widely 
>>>> used.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Mixing joda and java 8 time
>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant
>>>>> for compatibility with the rest of Beam and the previous DATETIME 
>>>>> primitive
>>>>> type. It feels weird, but it also makes a certain sort of sense to use 
>>>>> joda
>>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>> significant effect on end users - the schema inference code could generate
>>>>> conversions between java 8 time and joda time (as we already do for
>>>>> converting between various joda time types [5]) so user types can use
>>>>> either one.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Arbitrary Logical Types and SQL
>>>>> Previously much of the SQL code was written to operate on the _base
>>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>> MillisInstant logical type as the default for SQL date and time types any
>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>
>>>>> My solution was just to update SQL code to only ever reference the
>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>
>>>>> My change does pass all of the SQL tests (including internal Google
>>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>>> that SqlTransform would be able to operate on user types that contain
>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>> those types. This is just speculation though, there aren't actually any
>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>> would have worked since there are several places where we explicitly check
>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>
>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>> implicitly stripping away logical types like this in schema-aware
>>>>> transforms. A logical type indicates that the value has some additional
>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>>
>>>>
>>>> I disagree with this personally:
>>>>
>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>> and a LogicalType is just a nice way of naming that and allowing the use of
>>>> other classes to manage that type. For example, Let's say you have a record
>>>> containing latitude and longitude fields - today you can easily write SQL
>>>> over these fields. Later you decide that this is a very common pattern in
>>>> your data and decide to create a Location logical type; this also lets you
>>>> represent this with a convenience Location class that has many other
>>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>>> analyzing your data stream suddenly became harder as a result.
>>>>
>>>
> I would argue that defining the logical type *has* changed the semantic
> meaning. Before you just had a tuple of floating point numbers that
> happened to have fields named lat and long. After defining a logical type
> you have a location, and lat and long have a restricted range.
>
>
>>>> Another reason I don't like this, is that the pattern that we have
>>>> encouraged (and already use extensively) is to use logical types when
>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>> logical types aren't supported by SQL, then we are effectively saying that
>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>> will have to special case the logical types used by every single such input
>>>> format, which does not seem sustainable).
>>>>
>>>
>>>> I think the problem you ran into is that today an aggregation deals
>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>> everything by the representation type or everything by the logical type. If
>>>> we instead allowed SQL to specify well-known logical types (e.g. Instant),
>>>> I think that this problem would go away.
>>>>
>>>>
>>>>>
>>>>>
>>>>> If anyone has any concerns with these points could you raise them
>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it would
>>>>> be better served by separate threads, or maybe a doc.)
>>>>>
>>>>> Thanks!
>>>>> Brian
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>> [4]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>> [5]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>> [6]
>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>> [7]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>
>>>>

Re: Removing DATETIME in Java Schemas

Reply via email to