Re: [DISCUSS] More precision supported by DATETIME field in Schema

Reuven Lax Wed, 07 Nov 2018 01:16:32 -0800

As you said, this would be update incompatible across all streaming
pipelines. At the very least this would be a big problem for Dataflow
users, and I believe many Flink users as well. I'm not sure the benefit
here justifies causing problems for so many users.


Reuven

On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <[email protected]> wrote:

> Yes, microseconds is a good compromise for covering a long enough
> timespan that there's little reason it could be hit (even for
> processing historical data).
>
> Regarding backwards compatibility, could we just change the internal
> representation of Beam's element timestamps, possibly with new APIs to
> access the finer granularity? (True, it may not be upgrade
> compatible.)
> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <[email protected]> wrote:
> >
> > The main difference (though possibly theoretical) is when time runs out.
> With 64 bits and nanosecond precision, we can only represent times about
> 244 years in the future (or the past).
> >
> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <[email protected]> wrote:
> >>
> >> I like nanoseconds as extremely future-proof. What about specing this
> out in stages (1) domain of values (2) portable encoding that can represent
> those values (3) language-specific types to embed the values in.
> >>
> >> 1. If it is a nanosecond-precision absolute time, and we eventually
> want to migrate event time timestamps to match, then we need values for
> "end of global window" and "end of time". TBH I am not sure we need both of
> these any more. We can either define a max on the nanosecond range or
> create distinguished values.
> >>
> >> 2. For portability, presumably an order-preserving integer encoding of
> nanoseconds since epoch with whatever tweaks to allow for representing the
> end of time. It might be useful to find a way to allow multiple. Not super
> useful at a particular version, but might have given us a migration path.
> It would also allow experiments for performance.
> >>
> >> 3. We could probably find a way to keep user-facing API compatibility
> here while increasing underlying precision at 1 and 2, but I probably not
> worth it. A new Java type IMO addresses the lossiness issue because a user
> would have to explicitly request truncation to assign to a millis event
> time timestamp.
> >>
> >> Kenn
> >>
> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <[email protected]> wrote:
> >>>
> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
> well as for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
> >>>
> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <[email protected]>
> wrote:
> >>>>
> >>>> +1 to more precision even to the nano level, probably via Reuven's
> >>>> proposal of a different internal representation.
> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <[email protected]>
> wrote:
> >>>> >
> >>>> > +1 to offering more granular timestamps in general. I think it will
> be
> >>>> > odd if setting the element timestamp from a row DATETIME field is
> >>>> > lossy, so we should seriously consider upgrading that as well.
> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <[email protected]> wrote:
> >>>> > >
> >>>> > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> >>>> > >
> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <[email protected]>
> wrote:
> >>>> > >>
> >>>> > >> Thanks Reuven!
> >>>> > >>
> >>>> > >> I think Reuven gives the third option:
> >>>> > >>
> >>>> > >> Change internal representation of DATETIME field in Row. Still
> keep public ReadableDateTime getDateTime(String fieldName) API to be
> compatible with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> >>>> > >>
> >>>> > >> -Rui
> >>>> > >>
> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <[email protected]>
> wrote:
> >>>> > >>>
> >>>> > >>> I would vote that we change the internal representation of Row
> to something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> >>>> > >>>
> >>>> > >>> We should still keep accessor methods that return and take Joda
> objects, as the rest of Beam still depends on Joda.
> >>>> > >>>
> >>>> > >>> Reuven
> >>>> > >>>
> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <[email protected]>
> wrote:
> >>>> > >>>>
> >>>> > >>>> Hi Community,
> >>>> > >>>>
> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
> to the precision of millisecond. It has good enough precision to represent
> timestamp of event time, but it is not enough for the real "time" data. For
> the "time" type data, we probably need to support even up to the precision
> of nanosecond.
> >>>> > >>>>
> >>>> > >>>> Unfortunately, Joda decided to keep the precision of
> millisecond: https://github.com/JodaOrg/joda-time/issues/139.
> >>>> > >>>>
> >>>> > >>>> If we want to support the precision of nanosecond, we could
> have two options:
> >>>> > >>>>
> >>>> > >>>> Option one: utilize current FieldType's metadata field, such
> that we could set something into meta data and Row could check the metadata
> to decide what's saved in DATETIME field: Joda's Datetime or an
> implementation that supports nanosecond.
> >>>> > >>>>
> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP
> field?), to have an implementation to support higher precision of time.
> >>>> > >>>>
> >>>> > >>>> What do you think about the need of higher precision for time
> type and which option is preferred?
> >>>> > >>>>
> >>>> > >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Reply via email to