Hi,
just an idea on these related topics that appear these days - it might
help to realize, that what we actually don't need a full arithmetic on
timestamps (Beam model IMHO doesn't need to know exactly what is the
exact difference of two events). What we actually need is a slightly
simplified algebra. Given two timestamps T1 and T2 and a "duration" (a
different type from timestamp), we need operations (not 100% sure that
this is exhaustive, but seems to be):
- is_preceding(T1, T2): bool
- important !is_preceding(T1, T2) does NOT imply that
is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1)
would mean events are _concurrent_
- this relation has to be also antisymmetric
- given this function we can construct a comparator, where multiple
distinct timestamps can be "equal" (or with no particular ordering,
which is natural property of time)
- min_timestamp_following(T1, duration): T2
- that would return a timestamp for which is_preceding(T1 +
duration, T2) would return true and no other timestamp X would exist for
which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
- actually, this function would serve as the definition for the
duration object
If we can supply this algebra, it seems that we can use any
representation of timestamps and intervals. It might be (probably) even
possible to let user specify his own type used as timestamps and
durations, which could solve the issues of not currently being able to
correctly represent timestamps lower than Long.MIN_VALUE (although we
can get data for that low timestamps - cosmic microwave background being
one example :)). Specifying this algebra actually probably boils down to
proposal (3) in Robert's thread [1].
Just my 2 cents.
Jan
[1]
https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
On 11/13/19 10:11 AM, jincheng sun wrote:
Thanks for bringing up this discussion @Luke.
As @Kenn mentioned, in Beam we have defined the constants value for
the min/max/end of global window. I noticed that
google.protobuf.Timestamp/Duration is only used in window definitions,
such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
I think that both RFC 3339 and Beam's current implementation are big
enough to express a common window definitions. But users can really
define a window size that outside the scope of the RFC 3339.
Conceptually, we should not limit the time range for window(although I
think the range of RPC 3339 is big enough in most cases).
To ensure that people well know the background of the discussion, hope
you don't mind that I put the original conversion thread[1] here.
Best,
Jincheng
[1] https://github.com/apache/beam/pull/10041#discussion_r344380809
Robert Bradshaw <[email protected] <mailto:[email protected]>>
于2019年11月12日周二 下午4:09写道:
I agree about it being a tagged union in the model (together with
actual_time(...) - epsilon). It's not just a performance hack though,
it's also (as discussed elsewhere) a question of being able to find an
embedding into existing datetime libraries. The real question here is
whether we should limit ourselves to just these 10000 years AD, or
find value in being able to process events for the lifetime of the
universe (or, at least, recorded human history). Artificially limiting
in this way would seem surprising to me at least.
On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <[email protected]
<mailto:[email protected]>> wrote:
>
> The max timestamp, min timestamp, and end of the global window
are all performance hacks in my view. Timestamps in beam are
really a tagged union:
>
> timestamp ::= min | max | end_of_global | actual_time(...
some quantitative timestamp ...)
>
> with the ordering
>
> min < actual_time(...) < end_of_global < max
>
> We chose arbitrary numbers so that we could do simple numeric
comparisons and arithmetic.
>
> Kenn
>
> On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <[email protected]
<mailto:[email protected]>> wrote:
>>
>> While crites@ was investigating using protobuf to represent
Apache Beam timestamps within the TestStreamEvents, he found out
that the well known type google.protobuf.Timestamp doesn't support
certain timestamps we were using in our tests (specifically the
max timestamp that Apache Beam supports).
>>
>> This lead me to investigate and the well known type
google.protobuf.Timestamp supports dates/times from
0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z which is
much smaller than the timestamp range that Apache Beam currently
supports -9223372036854775ms to 9223372036854775ms which is about
292277BC to 294247AD (it was difficult to find a time range that
represented this).
>>
>> Similarly the google.protobuf.Duration represents any time
range over those ~10000 years. Google decided to limit their range
to be compatible with the RFC 3339[2] standard to which does
simplify many things since it guarantees that all RFC 3339 time
parsing/manipulation libraries are supported.
>>
>> Should we:
>> A) define our own timestamp/duration types to be able to
represent the full time range that Apache Beam can express?
>> B) limit the valid timestamps in Apache Beam to some standard
such as RFC 3339?
>>
>> This discussion is somewhat related to the efforts to support
nano timestamps[2].
>>
>> 1: https://tools.ietf.org/html/rfc3339
>> 2:
https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E