Re: [DISCUSS] More precision supported by DATETIME field in Schema

Rui Wang Thu, 08 Nov 2018 15:33:35 -0800

https://github.com/apache/beam/pull/6991


I am using java.time.instant as the internal representation to replace Joda
time for DateTime field in the PR. The java.time.instant uses a *long* to
save seconds-after-epoch and a *int* to save nanoseconds-of-second.
Therefore 64 bits are fully used for seconds-after-epoch, which loses
nothing.

Comments are very welcome to this PR.

-Rui

On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <[email protected]> wrote:

> As you said, this would be update incompatible across all streaming
> pipelines. At the very least this would be a big problem for Dataflow
> users, and I believe many Flink users as well. I'm not sure the benefit
> here justifies causing problems for so many users.
>
> Reuven
>
> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> Yes, microseconds is a good compromise for covering a long enough
>> timespan that there's little reason it could be hit (even for
>> processing historical data).
>>
>> Regarding backwards compatibility, could we just change the internal
>> representation of Beam's element timestamps, possibly with new APIs to
>> access the finer granularity? (True, it may not be upgrade
>> compatible.)
>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <[email protected]> wrote:
>> >
>> > The main difference (though possibly theoretical) is when time runs
>> out. With 64 bits and nanosecond precision, we can only represent times
>> about 244 years in the future (or the past).
>> >
>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <[email protected]>
>> wrote:
>> >>
>> >> I like nanoseconds as extremely future-proof. What about specing this
>> out in stages (1) domain of values (2) portable encoding that can represent
>> those values (3) language-specific types to embed the values in.
>> >>
>> >> 1. If it is a nanosecond-precision absolute time, and we eventually
>> want to migrate event time timestamps to match, then we need values for
>> "end of global window" and "end of time". TBH I am not sure we need both of
>> these any more. We can either define a max on the nanosecond range or
>> create distinguished values.
>> >>
>> >> 2. For portability, presumably an order-preserving integer encoding of
>> nanoseconds since epoch with whatever tweaks to allow for representing the
>> end of time. It might be useful to find a way to allow multiple. Not super
>> useful at a particular version, but might have given us a migration path.
>> It would also allow experiments for performance.
>> >>
>> >> 3. We could probably find a way to keep user-facing API compatibility
>> here while increasing underlying precision at 1 and 2, but I probably not
>> worth it. A new Java type IMO addresses the lossiness issue because a user
>> would have to explicitly request truncation to assign to a millis event
>> time timestamp.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <[email protected]> wrote:
>> >>>
>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
>> well as for Beam timestamps in general?  The latter likely has a bunch of
>> downstream consequences for all runners.
>> >>>
>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <[email protected]>
>> wrote:
>> >>>>
>> >>>> +1 to more precision even to the nano level, probably via Reuven's
>> >>>> proposal of a different internal representation.
>> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <[email protected]>
>> wrote:
>> >>>> >
>> >>>> > +1 to offering more granular timestamps in general. I think it
>> will be
>> >>>> > odd if setting the element timestamp from a row DATETIME field is
>> >>>> > lossy, so we should seriously consider upgrading that as well.
>> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <[email protected]>
>> wrote:
>> >>>> > >
>> >>>> > > One related issue that came up before is that we (perhaps
>> unnecessarily) restrict the precision of timestamps in the Python SDK to
>> milliseconds because of legacy reasons related to the Java runner's use of
>> Joda time.  Perhaps Beam portability should natively use a more granular
>> timestamp unit.
>> >>>> > >
>> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <[email protected]>
>> wrote:
>> >>>> > >>
>> >>>> > >> Thanks Reuven!
>> >>>> > >>
>> >>>> > >> I think Reuven gives the third option:
>> >>>> > >>
>> >>>> > >> Change internal representation of DATETIME field in Row. Still
>> keep public ReadableDateTime getDateTime(String fieldName) API to be
>> compatible with existing code. And I think we could add one more API to
>> getDataTimeNanosecond. This option is different from the option one because
>> option one actually maintains two implementation of time.
>> >>>> > >>
>> >>>> > >> -Rui
>> >>>> > >>
>> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <[email protected]>
>> wrote:
>> >>>> > >>>
>> >>>> > >>> I would vote that we change the internal representation of Row
>> to something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>> >>>> > >>>
>> >>>> > >>> We should still keep accessor methods that return and take
>> Joda objects, as the rest of Beam still depends on Joda.
>> >>>> > >>>
>> >>>> > >>> Reuven
>> >>>> > >>>
>> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <[email protected]>
>> wrote:
>> >>>> > >>>>
>> >>>> > >>>> Hi Community,
>> >>>> > >>>>
>> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by
>> Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is
>> limited to the precision of millisecond. It has good enough precision to
>> represent timestamp of event time, but it is not enough for the real "time"
>> data. For the "time" type data, we probably need to support even up to the
>> precision of nanosecond.
>> >>>> > >>>>
>> >>>> > >>>> Unfortunately, Joda decided to keep the precision of
>> millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>> >>>> > >>>>
>> >>>> > >>>> If we want to support the precision of nanosecond, we could
>> have two options:
>> >>>> > >>>>
>> >>>> > >>>> Option one: utilize current FieldType's metadata field, such
>> that we could set something into meta data and Row could check the metadata
>> to decide what's saved in DATETIME field: Joda's Datetime or an
>> implementation that supports nanosecond.
>> >>>> > >>>>
>> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP
>> field?), to have an implementation to support higher precision of time.
>> >>>> > >>>>
>> >>>> > >>>> What do you think about the need of higher precision for time
>> type and which option is preferred?
>> >>>> > >>>>
>> >>>> > >>>> -Rui
>>
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Reply via email to