Re: Removing DATETIME in Java Schemas

Kenneth Knowles Tue, 19 May 2020 11:53:32 -0700

On Fri, May 15, 2020 at 10:33 PM Reuven Lax <[email protected]> wrote:


>
>
> On Fri, May 15, 2020 at 8:10 PM Kenneth Knowles <[email protected]> wrote:
>
>>
>>
>> On Fri, May 15, 2020 at 5:25 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>> After thinking about this more I've softened on it some, but I'm still a
>>> little wary. I like Kenn's suggestion:
>>>
>>> > - it is OK to convert known logical types like MillisInstant or
>>> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
>>> TIME ZONE"
>>> > - it is OK to convert unknown logical types into their representation
>>> type
>>> > - coming out the other side of a SQL statement, a user should expect
>>> the type to be a row where every field has a type from the SQL dialect; the
>>> input types are gone
>>>
>>> I think what you're proposing is basically a preprocessing step on
>>> SqlTransform that demotes any unrecognized logical types to their
>>> representation. Is that a fair characterization?
>>>
>>
>> Exactly
>>
>>
>>> I wonder if we could push the demotion further into the transform to
>>> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
>>> if some logical type SQL knows nothing about is pulled into the SELECT *,
>>> it seems reasonable (from a user-perspective) to just pass the type through
>>> untouched, and automatically demoting it might be surprising.
>>>
>>
>> To me, it seems fancy to pass the type through. And I expect users will
>> soon try to do non-pass-through things involving equality or ordering,
>> where only specialized representations work right. So for the very limited
>> cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to
>> make it happen. If Calcite implicit conversions are a good way to implement
>> this, that is fine by me.
>>
>> Put another way: if the downstream transforms expect to get the logical
>> type back, won't automatic schema conversion already "just work"?
>>
>
> No. Let's say you are reading protos and writing protos with a SQL filter
> statement in between. If we lose the logical type, then the output schema
> may not be able to match against the proto schema. .e.g. if the proto has a
> field of type FIXED32, we reflect this with a Fixed32 logical type with a
> base type of integer. SQL can operate on the base type just fine because
> it's just an int32, but if you strip the logical type from the output then
> it will have trouble matching against the proto schema as Beam will infer
> an int32 type.
>

Makes sense. If we implicitly convert only one way then we create an
asymmetry. I missed the fact that we wouldn't coerce the representation
type back to the logical type.

Kenn


>
>> (I don't have any response to the avro/proto points except I think I
>> agree with both of you)
>>
>> Kenn
>>
>> What if we had the ability to implicitly convert logical types to their
>>> representation? Then the above example would work, because no implicit
>>> conversion is necessary, and also relations that introspect the logical
>>> type (e.g. projecting f_location.lat) could work by including an implicit
>>> conversion. The implicit conversion would still be a one-way operation, but
>>> it would only happen when necessary.
>>>
>>> I really have no idea how complicated this would be to implement. Would
>>> it be feasible? Is it ill-advised?
>>>
>>> Brian
>>>
>>> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <[email protected]> wrote:
>>>
>>>> I would not describe the base type as the "wire type." If that were
>>>> true, then the only base type we should support should be byte array.
>>>>
>>>> My simple point is that this is no different than normal schema fields.
>>>> You will find many normal schemas containing data encoded into other field
>>>> types. You will also find this in SQL databases. If naming these fields
>>>> with a logical type causes us to throw up our hands and declare that they
>>>> can't be introspectable, then our users will likely just avoid logical
>>>> types and encode this data into their own LONG and STRING fields. It will
>>>> also mean that SQL will not work well over Avro or Proto input.
>>>>
>>> Is the Avro/Proto connection really a problem? I would think that we
>>> could map most common Avro or Proto types to either primitive types or
>>> "standard" logical types which are the same as the ones used by SQL
>>> wherever possible, e.g. NanosInstant and NanosDuration added for proto.
>>>
>>>>
>>>> Reuven
>>>>
>>>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <[email protected]>
>>>> wrote:
>>>>
>>>>> My understanding is that the base type is effectively the wire format
>>>>> at the type, where the logical type is the in-memory representation for
>>>>> Java. For org.joda.time.Instant, this is just a wrapper around the
>>>>> underlying Long. However for the Date logical type, the LocalDate type has
>>>>> struct as the in-memory representation. There is a performance penalty to
>>>>> this conversion and I think there are cases where users (SQL for example)
>>>>> would want a fast path that skips the serialization/deserialization steps
>>>>> if possible.
>>>>>
>>>>> As for more complex logical types, I would urge caution in making
>>>>> assumptions around the base types. I expect there will be base types of
>>>>> strings containing JSON or byte arrays containing protobuf or avro. 
>>>>> ZetaSQL
>>>>> has a civil time module that packs a struct into a Long. All of these are
>>>>> things you can write SQL around, but I think the users might be somewhat
>>>>> surprised when they come out less than usable by default.
>>>>>
>>>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> When we created the portable representation of schemas last summer
>>>>>>> we intentionally did not include DATETIME as a primitive type [1], even
>>>>>>> though it already existed as a primitive type in Java [2]. There seemed 
>>>>>>> to
>>>>>>> be consensus around a couple of points: 1) At the very least DATETIME 
>>>>>>> is a
>>>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>>>> logical type is better suited for this purpose.
>>>>>>>
>>>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>>>> MillisInstant logical type backed by Long milliseconds since the epoch 
>>>>>>> (see
>>>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so 
>>>>>>> (and
>>>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>>>
>>>>>>> There could be a couple of concerns with this PR that I think would
>>>>>>> be better discussed here rather than on github.
>>>>>>>
>>>>>>>
>>>>>>> ## Breaking changes in low-level APIs
>>>>>>> The PR includes some breaking changes in public, low-level schema
>>>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>>>> fields (Long milliseconds since epoch rather than 
>>>>>>> org.joda.time.Instant).
>>>>>>> Both of these changes have the potential to break user code. That being
>>>>>>> said I think the risk is justified for a few reasons:
>>>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and 
>>>>>>> inferred
>>>>>>> schemas for user types) should seamlessly transition over to the new
>>>>>>> MillisInstant logical type.
>>>>>>> - Schemas are an experimental feature.
>>>>>>> - We can clearly document this breaking change in the release that
>>>>>>> includes it.
>>>>>>>
>>>>>>
>>>>>> I am a bit worried about this. Schemas have been marked experimental
>>>>>> for. almost two years, and I've found that many Beam users are using 
>>>>>> them.
>>>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>>>> where the code is theoretically experimental but practically is widely 
>>>>>> used.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## Mixing joda and java 8 time
>>>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>>>> as it's input type, while my MillisInstant type uses 
>>>>>>> org.joda.time.Instant
>>>>>>> for compatibility with the rest of Beam and the previous DATETIME 
>>>>>>> primitive
>>>>>>> type. It feels weird, but it also makes a certain sort of sense to use 
>>>>>>> joda
>>>>>>> time (which has millisecond precision) for MillisInstant, and java 8 
>>>>>>> time
>>>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>>>> significant effect on end users - the schema inference code could 
>>>>>>> generate
>>>>>>> conversions between java 8 time and joda time (as we already do for
>>>>>>> converting between various joda time types [5]) so user types can use
>>>>>>> either one.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## Arbitrary Logical Types and SQL
>>>>>>> Previously much of the SQL code was written to operate on the _base
>>>>>>> type_ value for any logical types. So for the new MillisInstant type, 
>>>>>>> SQL
>>>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>>>> MillisInstant logical type as the default for SQL date and time types 
>>>>>>> any
>>>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>>>
>>>>>>> My solution was just to update SQL code to only ever reference the
>>>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good 
>>>>>>> example
>>>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>>>
>>>>>>> My change does pass all of the SQL tests (including internal Google
>>>>>>> tests), but I'm a little concerned that using the base type throughout 
>>>>>>> SQL
>>>>>>> was a conscious decision that I'm stomping on. In theory, it could 
>>>>>>> ensure
>>>>>>> that SqlTransform would be able to operate on user types that contain
>>>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>>>> those types. This is just speculation though, there aren't actually any
>>>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>>>> would have worked since there are several places where we explicitly 
>>>>>>> check
>>>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>>>
>>>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>>>> implicitly stripping away logical types like this in schema-aware
>>>>>>> transforms. A logical type indicates that the value has some additional
>>>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>>>> user explicitly allows it, or first converts to the base type 
>>>>>>> themselves.
>>>>>>>
>>>>>>
>>>>>> I disagree with this personally:
>>>>>>
>>>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>>>> and a LogicalType is just a nice way of naming that and allowing the use 
>>>>>> of
>>>>>> other classes to manage that type. For example, Let's say you have a 
>>>>>> record
>>>>>> containing latitude and longitude fields - today you can easily write SQL
>>>>>> over these fields. Later you decide that this is a very common pattern in
>>>>>> your data and decide to create a Location logical type; this also lets 
>>>>>> you
>>>>>> represent this with a convenience Location class that has many other
>>>>>> helpful features (distance functions, builders, etc.). It seems 
>>>>>> unfortunate
>>>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>>>> Introducing a logical type hasn't changed the semantic meaning of your 
>>>>>> data
>>>>>> at. all. All it has done is let you name it, and it would be unfortunate 
>>>>>> if
>>>>>> analyzing your data stream suddenly became harder as a result.
>>>>>>
>>>>>
>>> I would argue that defining the logical type *has* changed the semantic
>>> meaning. Before you just had a tuple of floating point numbers that
>>> happened to have fields named lat and long. After defining a logical type
>>> you have a location, and lat and long have a restricted range.
>>>
>>>
>>>>>> Another reason I don't like this, is that the pattern that we have
>>>>>> encouraged (and already use extensively) is to use logical types when
>>>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>>>> logical types aren't supported by SQL, then we are effectively saying 
>>>>>> that
>>>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>>>> will have to special case the logical types used by every single such 
>>>>>> input
>>>>>> format, which does not seem sustainable).
>>>>>>
>>>>>
>>>>>> I think the problem you ran into is that today an aggregation deals
>>>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>>>> everything by the representation type or everything by the logical type. 
>>>>>> If
>>>>>> we instead allowed SQL to specify well-known logical types (e.g. 
>>>>>> Instant),
>>>>>> I think that this problem would go away.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If anyone has any concerns with these points could you raise them
>>>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it
>>>>>>> would be better served by separate threads, or maybe a doc.)
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Brian
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>>>> [4]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>>>> [5]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>>>> [6]
>>>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>>>> [7]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>>>
>>>>>>

Re: Removing DATETIME in Java Schemas

Reply via email to