https://github.com/apache/beam/pull/6991
I am using java.time.instant as the internal representation to replace Joda time for DateTime field in the PR. The java.time.instant uses a *long* to save seconds-after-epoch and a *int* to save nanoseconds-of-second. Therefore 64 bits are fully used for seconds-after-epoch, which loses nothing. Comments are very welcome to this PR. -Rui On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <re...@google.com> wrote: > As you said, this would be update incompatible across all streaming > pipelines. At the very least this would be a big problem for Dataflow > users, and I believe many Flink users as well. I'm not sure the benefit > here justifies causing problems for so many users. > > Reuven > > On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <rober...@google.com> > wrote: > >> Yes, microseconds is a good compromise for covering a long enough >> timespan that there's little reason it could be hit (even for >> processing historical data). >> >> Regarding backwards compatibility, could we just change the internal >> representation of Beam's element timestamps, possibly with new APIs to >> access the finer granularity? (True, it may not be upgrade >> compatible.) >> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote: >> > >> > The main difference (though possibly theoretical) is when time runs >> out. With 64 bits and nanosecond precision, we can only represent times >> about 244 years in the future (or the past). >> > >> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <k...@apache.org> >> wrote: >> >> >> >> I like nanoseconds as extremely future-proof. What about specing this >> out in stages (1) domain of values (2) portable encoding that can represent >> those values (3) language-specific types to embed the values in. >> >> >> >> 1. If it is a nanosecond-precision absolute time, and we eventually >> want to migrate event time timestamps to match, then we need values for >> "end of global window" and "end of time". TBH I am not sure we need both of >> these any more. We can either define a max on the nanosecond range or >> create distinguished values. >> >> >> >> 2. For portability, presumably an order-preserving integer encoding of >> nanoseconds since epoch with whatever tweaks to allow for representing the >> end of time. It might be useful to find a way to allow multiple. Not super >> useful at a particular version, but might have given us a migration path. >> It would also allow experiments for performance. >> >> >> >> 3. We could probably find a way to keep user-facing API compatibility >> here while increasing underlying precision at 1 and 2, but I probably not >> worth it. A new Java type IMO addresses the lossiness issue because a user >> would have to explicitly request truncation to assign to a millis event >> time timestamp. >> >> >> >> Kenn >> >> >> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <c...@google.com> wrote: >> >>> >> >>> Is the proposal to do this for both Beam Schema DATETIME fields as >> well as for Beam timestamps in general? The latter likely has a bunch of >> downstream consequences for all runners. >> >>> >> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ieme...@gmail.com> >> wrote: >> >>>> >> >>>> +1 to more precision even to the nano level, probably via Reuven's >> >>>> proposal of a different internal representation. >> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>>> > >> >>>> > +1 to offering more granular timestamps in general. I think it >> will be >> >>>> > odd if setting the element timestamp from a row DATETIME field is >> >>>> > lossy, so we should seriously consider upgrading that as well. >> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <c...@google.com> >> wrote: >> >>>> > > >> >>>> > > One related issue that came up before is that we (perhaps >> unnecessarily) restrict the precision of timestamps in the Python SDK to >> milliseconds because of legacy reasons related to the Java runner's use of >> Joda time. Perhaps Beam portability should natively use a more granular >> timestamp unit. >> >>>> > > >> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ruw...@google.com> >> wrote: >> >>>> > >> >> >>>> > >> Thanks Reuven! >> >>>> > >> >> >>>> > >> I think Reuven gives the third option: >> >>>> > >> >> >>>> > >> Change internal representation of DATETIME field in Row. Still >> keep public ReadableDateTime getDateTime(String fieldName) API to be >> compatible with existing code. And I think we could add one more API to >> getDataTimeNanosecond. This option is different from the option one because >> option one actually maintains two implementation of time. >> >>>> > >> >> >>>> > >> -Rui >> >>>> > >> >> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> >> wrote: >> >>>> > >>> >> >>>> > >>> I would vote that we change the internal representation of Row >> to something other than Joda. Java 8 times would give us at least >> microseconds, and if we want nanoseconds we could simply store it as a >> number. >> >>>> > >>> >> >>>> > >>> We should still keep accessor methods that return and take >> Joda objects, as the rest of Beam still depends on Joda. >> >>>> > >>> >> >>>> > >>> Reuven >> >>>> > >>> >> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ruw...@google.com> >> wrote: >> >>>> > >>>> >> >>>> > >>>> Hi Community, >> >>>> > >>>> >> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by >> Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is >> limited to the precision of millisecond. It has good enough precision to >> represent timestamp of event time, but it is not enough for the real "time" >> data. For the "time" type data, we probably need to support even up to the >> precision of nanosecond. >> >>>> > >>>> >> >>>> > >>>> Unfortunately, Joda decided to keep the precision of >> millisecond: https://github.com/JodaOrg/joda-time/issues/139. >> >>>> > >>>> >> >>>> > >>>> If we want to support the precision of nanosecond, we could >> have two options: >> >>>> > >>>> >> >>>> > >>>> Option one: utilize current FieldType's metadata field, such >> that we could set something into meta data and Row could check the metadata >> to decide what's saved in DATETIME field: Joda's Datetime or an >> implementation that supports nanosecond. >> >>>> > >>>> >> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP >> field?), to have an implementation to support higher precision of time. >> >>>> > >>>> >> >>>> > >>>> What do you think about the need of higher precision for time >> type and which option is preferred? >> >>>> > >>>> >> >>>> > >>>> -Rui >> >