Hi,

just an idea on these related topics that appear these days - it might help to realize, that what we actually don't need a full arithmetic on timestamps (Beam model IMHO doesn't need to know exactly what is the exact difference of two events). What we actually need is a slightly simplified algebra. Given two timestamps T1 and T2 and a "duration" (a different type from timestamp), we need operations (not 100% sure that this is exhaustive, but seems to be):

 - is_preceding(T1, T2): bool

   - important !is_preceding(T1, T2) does NOT imply that is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would mean events are _concurrent_

   - this relation has to be also antisymmetric

   - given this function we can construct a comparator, where multiple distinct timestamps can be "equal" (or with no particular ordering, which is natural property of time)

 - min_timestamp_following(T1, duration): T2

   - that would return a timestamp for which is_preceding(T1 + duration, T2) would return true and no other timestamp X would exist for which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true

   - actually, this function would serve as the definition for the duration object

If we can supply this algebra, it seems that we can use any representation of timestamps and intervals. It might be (probably) even possible to let user specify his own type used as timestamps and durations, which could solve the issues of not currently being able to correctly represent timestamps lower than Long.MIN_VALUE (although we can get data for that low timestamps - cosmic microwave background being one example :)). Specifying this algebra actually probably boils down to proposal (3) in Robert's thread [1].

Just my 2 cents.

Jan

[1] https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E

On 11/13/19 10:11 AM, jincheng sun wrote:
Thanks for bringing up this discussion @Luke.

As @Kenn mentioned, in Beam we have defined the constants value for the min/max/end of global window. I noticed that google.protobuf.Timestamp/Duration is only used in window definitions, such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.

I think that both RFC 3339 and Beam's current implementation are big enough to express a common window definitions. But users can really define a window size that outside the scope of the RFC 3339. Conceptually, we should not limit the time range for window(although I think the range of RPC 3339 is big enough in most cases).

To ensure that people well know the background of the discussion, hope you don't mind that I put the original conversion thread[1] here.

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10041#discussion_r344380809

Robert Bradshaw <[email protected] <mailto:[email protected]>> 于2019年11月12日周二 下午4:09写道:

    I agree about it being a tagged union in the model (together with
    actual_time(...) - epsilon). It's not just a performance hack though,
    it's also (as discussed elsewhere) a question of being able to find an
    embedding into existing datetime libraries. The real question here is
    whether we should limit ourselves to just these 10000 years AD, or
    find value in being able to process events for the lifetime of the
    universe (or, at least, recorded human history). Artificially limiting
    in this way would seem surprising to me at least.

    On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <[email protected]
    <mailto:[email protected]>> wrote:
    >
    > The max timestamp, min timestamp, and end of the global window
    are all performance hacks in my view. Timestamps in beam are
    really a tagged union:
    >
    >     timestamp ::= min | max | end_of_global | actual_time(...
    some quantitative timestamp ...)
    >
    > with the ordering
    >
    >     min < actual_time(...) < end_of_global < max
    >
    > We chose arbitrary numbers so that we could do simple numeric
    comparisons and arithmetic.
    >
    > Kenn
    >
    > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <[email protected]
    <mailto:[email protected]>> wrote:
    >>
    >> While crites@ was investigating using protobuf to represent
    Apache Beam timestamps within the TestStreamEvents, he found out
    that the well known type google.protobuf.Timestamp doesn't support
    certain timestamps we were using in our tests (specifically the
    max timestamp that Apache Beam supports).
    >>
    >> This lead me to investigate and the well known type
    google.protobuf.Timestamp supports dates/times from
    0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z which is
    much smaller than the timestamp range that Apache Beam currently
    supports -9223372036854775ms to 9223372036854775ms which is about
    292277BC to 294247AD (it was difficult to find a time range that
    represented this).
    >>
    >> Similarly the google.protobuf.Duration represents any time
    range over those ~10000 years. Google decided to limit their range
    to be compatible with the RFC 3339[2] standard to which does
    simplify many things since it guarantees that all RFC 3339 time
    parsing/manipulation libraries are supported.
    >>
    >> Should we:
    >> A) define our own timestamp/duration types to be able to
    represent the full time range that Apache Beam can express?
    >> B) limit the valid timestamps in Apache Beam to some standard
    such as RFC 3339?
    >>
    >> This discussion is somewhat related to the efforts to support
    nano timestamps[2].
    >>
    >> 1: https://tools.ietf.org/html/rfc3339
    >> 2:
    
https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E

Reply via email to