The timestamps flow both ways since:
* IO authors are responsible for saying what the watermark timestamp is and
stateful DoFns also allow for users to set timers in relative and
processing time domains.
* Runner authors need to understand and merge these timestamps together to
compute what the global watermark is for a PCollection.

On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <[email protected]> wrote:

> My two cents are we just need a proto representation for timestamps and
> durations that includes units. The underlying library can then determine
> what to do with it. Then further, we can have a standard across Beam SDKs
> and Runners of how to interpret the proto. Using a raw int64 for timestamps
> and durations is confusing and *very very *bug prone (as we have seen in
> the past).
>
> I don't know if this is relevant, but does Apache Beam have any standards
> surrounding leap years or seconds? If we were to make our own timestamp
> format, would we have to worry about that? Or is the timestamp supplied to
> Beam a property of the underlying system giving Beam the timestamp? If it
> is, then there may be some interop problems between sources.
>
> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <[email protected]> wrote:
>
>> I do agree that Apache Beam can represent dates and times with arbitrary
>> precision and can do it many different ways.
>>
>> My argument has always been should around whether we restrict this range
>> to a common standard to increase interoperability across other systems. For
>> example, SQL database servers have varying degrees as to what ranges they
>> support:
>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>> mask)
>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>
>> The common case of the global window containing timestamps that are
>> before and after all of these supported ranges above means that our users
>> can't represent a global window within a database using its common data
>> types.
>>
>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>> 2:
>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>> 4:
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>> 5:
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>
>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> just an idea on these related topics that appear these days - it might
>>> help to realize, that what we actually don't need a full arithmetic on
>>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
>>> difference of two events). What we actually need is a slightly simplified
>>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
>>> from timestamp), we need operations (not 100% sure that this is exhaustive,
>>> but seems to be):
>>>
>>>  - is_preceding(T1, T2): bool
>>>
>>>    - important !is_preceding(T1, T2) does NOT imply that
>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>> mean events are _concurrent_
>>>
>>>    - this relation has to be also antisymmetric
>>>
>>>    - given this function we can construct a comparator, where multiple
>>> distinct timestamps can be "equal" (or with no particular ordering, which
>>> is natural property of time)
>>>
>>>  - min_timestamp_following(T1, duration): T2
>>>
>>>    - that would return a timestamp for which is_preceding(T1 + duration,
>>> T2) would return true and no other timestamp X would exist for which
>>> is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>
>>>    - actually, this function would serve as the definition for the
>>> duration object
>>>
>>> If we can supply this algebra, it seems that we can use any
>>> representation of timestamps and intervals. It might be (probably) even
>>> possible to let user specify his own type used as timestamps and durations,
>>> which could solve the issues of not currently being able to correctly
>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>> for that low timestamps - cosmic microwave background being one example
>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>> in Robert's thread [1].
>>>
>>> Just my 2 cents.
>>>
>>> Jan
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>
>>> Thanks for bringing up this discussion @Luke.
>>>
>>> As @Kenn mentioned, in Beam we have defined the constants value for the
>>> min/max/end of global window. I noticed that
>>> google.protobuf.Timestamp/Duration is only used in window definitions,
>>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>>
>>> I think that both RFC 3339 and Beam's current implementation are big
>>> enough to express a common window definitions. But users can really
>>> define a window size that outside the scope of the RFC 3339.
>>> Conceptually, we should not limit the time range for window(although I
>>> think the range of RPC 3339 is big enough in most cases).
>>>
>>> To ensure that people well know the background of the discussion, hope
>>> you don't mind that I put the original conversion thread[1] here.
>>>
>>> Best,
>>> Jincheng
>>>
>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>
>>> Robert Bradshaw <[email protected]> 于2019年11月12日周二 下午4:09写道:
>>>
>>>> I agree about it being a tagged union in the model (together with
>>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>>> it's also (as discussed elsewhere) a question of being able to find an
>>>> embedding into existing datetime libraries. The real question here is
>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>> find value in being able to process events for the lifetime of the
>>>> universe (or, at least, recorded human history). Artificially limiting
>>>> in this way would seem surprising to me at least.
>>>>
>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>> >
>>>> > The max timestamp, min timestamp, and end of the global window are
>>>> all performance hacks in my view. Timestamps in beam are really a tagged
>>>> union:
>>>> >
>>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>>> quantitative timestamp ...)
>>>> >
>>>> > with the ordering
>>>> >
>>>> >     min < actual_time(...) < end_of_global < max
>>>> >
>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>> comparisons and arithmetic.
>>>> >
>>>> > Kenn
>>>> >
>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <[email protected]> wrote:
>>>> >>
>>>> >> While crites@ was investigating using protobuf to represent Apache
>>>> Beam timestamps within the TestStreamEvents, he found out that the well
>>>> known type google.protobuf.Timestamp doesn't support certain timestamps we
>>>> were using in our tests (specifically the max timestamp that Apache Beam
>>>> supports).
>>>> >>
>>>> >> This lead me to investigate and the well known type
>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>> find a time range that represented this).
>>>> >>
>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>> over those ~10000 years. Google decided to limit their range to be
>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>> are supported.
>>>> >>
>>>> >> Should we:
>>>> >> A) define our own timestamp/duration types to be able to represent
>>>> the full time range that Apache Beam can express?
>>>> >> B) limit the valid timestamps in Apache Beam to some standard such
>>>> as RFC 3339?
>>>> >>
>>>> >> This discussion is somewhat related to the efforts to support nano
>>>> timestamps[2].
>>>> >>
>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>> >> 2:
>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>
>>>

Reply via email to