My understanding is that the base type is effectively the wire format at
the type, where the logical type is the in-memory representation for Java.
For org.joda.time.Instant, this is just a wrapper around the underlying
Long. However for the Date logical type, the LocalDate type has struct as
the in-memory representation. There is a performance penalty to this
conversion and I think there are cases where users (SQL for example) would
want a fast path that skips the serialization/deserialization steps if
possible.

As for more complex logical types, I would urge caution in making
assumptions around the base types. I expect there will be base types of
strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
has a civil time module that packs a struct into a Long. All of these are
things you can write SQL around, but I think the users might be somewhat
surprised when they come out less than usable by default.

On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:

>
>
> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bhule...@google.com>
> wrote:
>
>> When we created the portable representation of schemas last summer we
>> intentionally did not include DATETIME as a primitive type [1], even though
>> it already existed as a primitive type in Java [2]. There seemed to be
>> consensus around a couple of points: 1) At the very least DATETIME is a
>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>> logical type is better suited for this purpose.
>>
>> Since then, it's been my intention to replace Java's DATETIME with a
>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>
>> There could be a couple of concerns with this PR that I think would be
>> better discussed here rather than on github.
>>
>>
>> ## Breaking changes in low-level APIs
>> The PR includes some breaking changes in public, low-level schema APIs:
>> It removes DATETIME from the TypeName enum [4], and it will also change the
>> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
>> milliseconds since epoch rather than org.joda.time.Instant). Both of these
>> changes have the potential to break user code. That being said I think the
>> risk is justified for a few reasons:
>> - These are lower-level APIs that we don't intend for most users to use.
>> The preferred higher-level APIs (SQL, Schema transforms, and inferred
>> schemas for user types) should seamlessly transition over to the new
>> MillisInstant logical type.
>> - Schemas are an experimental feature.
>> - We can clearly document this breaking change in the release that
>> includes it.
>>
>
> I am a bit worried about this. Schemas have been marked experimental for.
> almost two years, and I've found that many Beam users are using them. If we
> make a breaking change, can we make sure that it is extremely clear to
> users how to fix their code? I'm afraid we are in the situation now where
> the code is theoretically experimental but practically is widely used.
>
>
>
>>
>>
>> ## Mixing joda and java 8 time
>> The NanosInstant logical type that Alex added uses java.time.Instant as
>> it's input type, while my MillisInstant type uses org.joda.time.Instant for
>> compatibility with the rest of Beam and the previous DATETIME primitive
>> type. It feels weird, but it also makes a certain sort of sense to use joda
>> time (which has millisecond precision) for MillisInstant, and java 8 time
>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>> significant effect on end users - the schema inference code could generate
>> conversions between java 8 time and joda time (as we already do for
>> converting between various joda time types [5]) so user types can use
>> either one.
>>
>
> +1
>
>
>>
>>
>> ## Arbitrary Logical Types and SQL
>> Previously much of the SQL code was written to operate on the _base type_
>> value for any logical types. So for the new MillisInstant type, SQL would
>> attempt to operate on the underlying Long, rather than on a
>> org.joda.time.Instant instance. Thus when I switched over to a
>> MillisInstant logical type as the default for SQL date and time types any
>> tests that used them began failing with ClassCastExceptions.
>>
>> My solution was just to update SQL code to only ever reference the input
>> type (i.e. org.joda.time.Instant) for logical types. A good example of this
>> type of change is in BeamAggregationRel [6].
>>
>> My change does pass all of the SQL tests (including internal Google
>> tests), but I'm a little concerned that using the base type throughout SQL
>> was a conscious decision that I'm stomping on. In theory, it could ensure
>> that SqlTransform would be able to operate on user types that contain
>> custom user logical types, without requiring SqlTransform to understand
>> those types. This is just speculation though, there aren't actually any
>> tests verifying that functionality. Also, I don't think any such tests
>> would have worked since there are several places where we explicitly check
>> for particular logical types (e.g. in BeamCalcRel [7]).
>>
>> Personally I'm ok with this move, because I'm against the idea of
>> implicitly stripping away logical types like this in schema-aware
>> transforms. A logical type indicates that the value has some additional
>> meaning beyond just the base type (and maybe a restricted range), and I
>> don't think transforms should be able to ignore that meaning unless the
>> user explicitly allows it, or first converts to the base type themselves.
>>
>
> I disagree with this personally:
>
> Arguably, _every_ schema has "additional meaning" beyond the fields, and a
> LogicalType is just a nice way of naming that and allowing the use of other
> classes to manage that type. For example, Let's say you have a record
> containing latitude and longitude fields - today you can easily write SQL
> over these fields. Later you decide that this is a very common pattern in
> your data and decide to create a Location logical type; this also lets you
> represent this with a convenience Location class that has many other
> helpful features (distance functions, builders, etc.). It seems unfortunate
> if you suddenly lose the ability to easily run SQL over this type.
> Introducing a logical type hasn't changed the semantic meaning of your data
> at. all. All it has done is let you name it, and it would be unfortunate if
> analyzing your data stream suddenly became harder as a result.
>
> Another reason I don't like this, is that the pattern that we have
> encouraged (and already use extensively) is to use logical types when
> converting from external formats such as avro, proto, kryo, etc.  If
> logical types aren't supported by SQL, then we are effectively saying that
> data coming from these formats can't be easily analyzed with SQL (or SQL
> will have to special case the logical types used by every single such input
> format, which does not seem sustainable).
>
> I think the problem you ran into is that today an aggregation deals with
> logical types in an all-or-nothing manner. You either aggregate everything
> by the representation type or everything by the logical type. If we instead
> allowed SQL to specify well-known logical types (e.g. Instant), I think
> that this problem would go away.
>
>
>>
>>
>> If anyone has any concerns with these points could you raise them here?
>> Otherwise I'd like to move ahead with the PR.
>> (Sorry this message turned out to be a bit of a novel. LMK if it would be
>> better served by separate threads, or maybe a doc.)
>>
>> Thanks!
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>> [2]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>> [4]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>> [5]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>> [6]
>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>> [7]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>
>

Reply via email to