The timestamps flow both ways since: * IO authors are responsible for saying what the watermark timestamp is and stateful DoFns also allow for users to set timers in relative and processing time domains. * Runner authors need to understand and merge these timestamps together to compute what the global watermark is for a PCollection.
On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <[email protected]> wrote: > My two cents are we just need a proto representation for timestamps and > durations that includes units. The underlying library can then determine > what to do with it. Then further, we can have a standard across Beam SDKs > and Runners of how to interpret the proto. Using a raw int64 for timestamps > and durations is confusing and *very very *bug prone (as we have seen in > the past). > > I don't know if this is relevant, but does Apache Beam have any standards > surrounding leap years or seconds? If we were to make our own timestamp > format, would we have to worry about that? Or is the timestamp supplied to > Beam a property of the underlying system giving Beam the timestamp? If it > is, then there may be some interop problems between sources. > > On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <[email protected]> wrote: > >> I do agree that Apache Beam can represent dates and times with arbitrary >> precision and can do it many different ways. >> >> My argument has always been should around whether we restrict this range >> to a common standard to increase interoperability across other systems. For >> example, SQL database servers have varying degrees as to what ranges they >> support: >> * Oracle 10[1]: 0001-01-01 to 9999-12-31 >> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through >> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format >> mask) >> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59' >> * Microsoft SQL: January 1, 1753, through December 31, 9999 for >> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5] >> >> The common case of the global window containing timestamps that are >> before and after all of these supported ranges above means that our users >> can't represent a global window within a database using its common data >> types. >> >> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html >> 2: >> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413 >> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html >> 4: >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15 >> 5: >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 >> >> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <[email protected]> wrote: >> >>> Hi, >>> >>> just an idea on these related topics that appear these days - it might >>> help to realize, that what we actually don't need a full arithmetic on >>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact >>> difference of two events). What we actually need is a slightly simplified >>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type >>> from timestamp), we need operations (not 100% sure that this is exhaustive, >>> but seems to be): >>> >>> - is_preceding(T1, T2): bool >>> >>> - important !is_preceding(T1, T2) does NOT imply that >>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would >>> mean events are _concurrent_ >>> >>> - this relation has to be also antisymmetric >>> >>> - given this function we can construct a comparator, where multiple >>> distinct timestamps can be "equal" (or with no particular ordering, which >>> is natural property of time) >>> >>> - min_timestamp_following(T1, duration): T2 >>> >>> - that would return a timestamp for which is_preceding(T1 + duration, >>> T2) would return true and no other timestamp X would exist for which >>> is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true >>> >>> - actually, this function would serve as the definition for the >>> duration object >>> >>> If we can supply this algebra, it seems that we can use any >>> representation of timestamps and intervals. It might be (probably) even >>> possible to let user specify his own type used as timestamps and durations, >>> which could solve the issues of not currently being able to correctly >>> represent timestamps lower than Long.MIN_VALUE (although we can get data >>> for that low timestamps - cosmic microwave background being one example >>> :)). Specifying this algebra actually probably boils down to proposal (3) >>> in Robert's thread [1]. >>> >>> Just my 2 cents. >>> >>> Jan >>> >>> [1] >>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E >>> On 11/13/19 10:11 AM, jincheng sun wrote: >>> >>> Thanks for bringing up this discussion @Luke. >>> >>> As @Kenn mentioned, in Beam we have defined the constants value for the >>> min/max/end of global window. I noticed that >>> google.protobuf.Timestamp/Duration is only used in window definitions, >>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc. >>> >>> I think that both RFC 3339 and Beam's current implementation are big >>> enough to express a common window definitions. But users can really >>> define a window size that outside the scope of the RFC 3339. >>> Conceptually, we should not limit the time range for window(although I >>> think the range of RPC 3339 is big enough in most cases). >>> >>> To ensure that people well know the background of the discussion, hope >>> you don't mind that I put the original conversion thread[1] here. >>> >>> Best, >>> Jincheng >>> >>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809 >>> >>> Robert Bradshaw <[email protected]> 于2019年11月12日周二 下午4:09写道: >>> >>>> I agree about it being a tagged union in the model (together with >>>> actual_time(...) - epsilon). It's not just a performance hack though, >>>> it's also (as discussed elsewhere) a question of being able to find an >>>> embedding into existing datetime libraries. The real question here is >>>> whether we should limit ourselves to just these 10000 years AD, or >>>> find value in being able to process events for the lifetime of the >>>> universe (or, at least, recorded human history). Artificially limiting >>>> in this way would seem surprising to me at least. >>>> >>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <[email protected]> >>>> wrote: >>>> > >>>> > The max timestamp, min timestamp, and end of the global window are >>>> all performance hacks in my view. Timestamps in beam are really a tagged >>>> union: >>>> > >>>> > timestamp ::= min | max | end_of_global | actual_time(... some >>>> quantitative timestamp ...) >>>> > >>>> > with the ordering >>>> > >>>> > min < actual_time(...) < end_of_global < max >>>> > >>>> > We chose arbitrary numbers so that we could do simple numeric >>>> comparisons and arithmetic. >>>> > >>>> > Kenn >>>> > >>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <[email protected]> wrote: >>>> >> >>>> >> While crites@ was investigating using protobuf to represent Apache >>>> Beam timestamps within the TestStreamEvents, he found out that the well >>>> known type google.protobuf.Timestamp doesn't support certain timestamps we >>>> were using in our tests (specifically the max timestamp that Apache Beam >>>> supports). >>>> >> >>>> >> This lead me to investigate and the well known type >>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to >>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp >>>> range that Apache Beam currently supports -9223372036854775ms to >>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to >>>> find a time range that represented this). >>>> >> >>>> >> Similarly the google.protobuf.Duration represents any time range >>>> over those ~10000 years. Google decided to limit their range to be >>>> compatible with the RFC 3339[2] standard to which does simplify many things >>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries >>>> are supported. >>>> >> >>>> >> Should we: >>>> >> A) define our own timestamp/duration types to be able to represent >>>> the full time range that Apache Beam can express? >>>> >> B) limit the valid timestamps in Apache Beam to some standard such >>>> as RFC 3339? >>>> >> >>>> >> This discussion is somewhat related to the efforts to support nano >>>> timestamps[2]. >>>> >> >>>> >> 1: https://tools.ietf.org/html/rfc3339 >>>> >> 2: >>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E >>>> >>>
