Re: Python SDK timestamp precision

Robert Bradshaw Tue, 23 Apr 2019 05:49:32 -0700

On Thu, Apr 18, 2019 at 12:23 AM Kenneth Knowles <k...@apache.org> wrote:
>
> For Robert's benefit, I want to point out that my proposal is to support 
> femtosecond data, with femtosecond-scale windows, even if watermarks/event 
> timestamps/holds are only millisecond precision.
>
> So the workaround once I have time, for SQL and schema-based transforms, will 
> be to have a logical type that matches the Java and protobuf definition of 
> nanos (seconds-since-epoch + nanos-in-second) to preserve the user's data. 
> And then when doing windowing inserting the necessary rounding somewhere in 
> the SQL or schema layers.


It seems to me that the underlying granularity of element timestamps
and window boundaries, as seen an operated on by the runner (and
transmitted over the FnAPI boundary), is not something we can make
invisible to the user (and consequently we cannot just insert rounding
on higher precision data and get the right results). However, I would
be very interested in seeing proposals that could get around this.
Watermarks, of course, can be as approximate (in one direction) as one
likes.

As for choice of granularity, it would be ideal if any time-like field
could be used as the timestamp (for subsequent windowing). On the
other hand, nanoseconds (or smaller) complicates the arithmetic and
encoding as a 64-bit int has a time range of only a couple hundred
years without overflow (which is an argument for microseconds, as they
are a nice balance between sub-second granularity and multi-millennia
span). Standardizing on milliseconds is more restrictive but has the
advantage that it's what Java and Joda Time use now (though it's
always easier to pad precision than round it away).

It would also be really nice to clean up the infinite-future being the
somewhat arbitrary max micros rounded to millis, and
end-of-global-window being infinite-future minus 1 hour (IIRC), etc.
as well as the ugly logic in Python to cope with millis-micros
conversion.

> On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <rob...@frantil.com> wrote:
>>
>> +1 for plan B. Nano second precision on windowing seems... a little much for 
>> a system that's aggregating data over time. Even for processing say particle 
>> super collider data, they'd get away with artificially increasing the 
>> granularity in batch settings.
>>
>> Now if they were streaming... they'd probably want femtoseconds anyway.
>> The point is, we should see if users demand it before adding in the 
>> necessary work.
>>
>> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath <chamik...@google.com> 
>> wrote:
>>>
>>> +1 for plan B as well. I think it's important to make timestamp precision 
>>> consistent now without introducing surprising behaviors for existing users. 
>>> But we should move towards a higher granularity timestamp precision in the 
>>> long run to support use-cases that Beam users otherwise might miss out (on 
>>> a runner that supports such precision).
>>>
>>> - Cham
>>>
>>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>> I also like Plan B because in the cross language case, the pipeline would 
>>>> not work since every party (Runners & SDKs) would have to be aware of the 
>>>> new beam:coder:windowed_value:v2 coder. Plan A has the property where if 
>>>> the SDK/Runner wasn't updated then it may start truncating the timestamps 
>>>> unexpectedly.
>>>>
>>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>> Kenn, this discussion is about the precision of the timestamp in the user 
>>>>> data. As you had mentioned, Runners need not have the same granularity of 
>>>>> user data as long as they correctly round the timestamp to guarantee that 
>>>>> triggers are executed correctly but the user data should have the same 
>>>>> precision across SDKs otherwise user data timestamps will be truncated in 
>>>>> cross language scenarios.
>>>>>
>>>>> Based on the systems that were listed, either microsecond or nanosecond 
>>>>> would make sense. The issue with changing the precision is that all Beam 
>>>>> runners except for possibly Beam Python on Dataflow are using millisecond 
>>>>> precision since they are all using the same Java Runner windowing/trigger 
>>>>> logic.
>>>>>
>>>>> Plan A: Swap precision to nanosecond
>>>>> 1) Change the Python SDK to only expose millisecond precision timestamps 
>>>>> (do now)
>>>>> 2) Change the user data encoding to support nanosecond precision (do now)
>>>>> 3) Swap runner libraries to be nanosecond precision aware updating all 
>>>>> window/triggering logic (do later)
>>>>> 4) Swap SDKs to expose nanosecond precision (do later)
>>>>>
>>>>> Plan B:
>>>>> 1) Change the Python SDK to only expose millisecond precision timestamps 
>>>>> and keep the data encoding as is (do now)
>>>>> (We could add greater precision later to plan B by creating a new version 
>>>>> beam:coder:windowed_value:v2 which would be nanosecond and would require 
>>>>> runners to correctly perform an internal conversions for 
>>>>> windowing/triggering.)
>>>>>
>>>>> I think we should go with Plan B and when users request greater precision 
>>>>> we can make that an explicit effort. What do people think?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org> 
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>>>>>
>>>>>> It would be nice to have a uniform precision for timestamps but, as Kenn
>>>>>> pointed out, timestamps are extracted from systems that have different
>>>>>> precision.
>>>>>>
>>>>>> To add to the list: Flink - milliseconds
>>>>>>
>>>>>> After all, it doesn't matter as long as there is sufficient precision
>>>>>> and conversions are done correctly.
>>>>>>
>>>>>> I think we could improve the situation by at least adding a
>>>>>> "milliseconds" constructor to the Python SDK's Timestamp.
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>>
>>>>>> On 17.04.19 04:13, Kenneth Knowles wrote:
>>>>>> > I am not so sure this is a good idea. Here are some systems and their
>>>>>> > precision:
>>>>>> >
>>>>>> > Arrow - microseconds
>>>>>> > BigQuery - microseconds
>>>>>> > New Java instant - nanoseconds
>>>>>> > Firestore - microseconds
>>>>>> > Protobuf - nanoseconds
>>>>>> > Dataflow backend - microseconds
>>>>>> > Postgresql - microseconds
>>>>>> > Pubsub publish time - nanoseconds
>>>>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>>>>>> > Cassandra - milliseconds
>>>>>> >
>>>>>> > IMO it is important to be able to treat any of these as a Beam
>>>>>> > timestamp, even though they aren't all streaming. Who knows when we
>>>>>> > might be ingesting a streamed changelog, or using them for reprocessing
>>>>>> > an archived stream. I think for this purpose we either should
>>>>>> > standardize on nanoseconds or make the runner's resolution independent
>>>>>> > of the data representation.
>>>>>> >
>>>>>> > I've had some offline conversations about this. I think we can have
>>>>>> > higher-than-runner precision in the user data, and allow WindowFns and
>>>>>> > DoFns to operate on this higher-than-runner precision data, and still
>>>>>> > have consistent watermark treatment. Watermarks are just bounds, after 
>>>>>> > all.
>>>>>> >
>>>>>> > Kenn
>>>>>> >
>>>>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org
>>>>>> > <mailto:t...@apache.org>> wrote:
>>>>>> >
>>>>>> >     The Python SDK currently uses timestamps in microsecond resolution
>>>>>> >     while Java SDK, as most would probably expect, uses milliseconds.
>>>>>> >
>>>>>> >     This causes a few difficulties with portability (Python coders need
>>>>>> >     to convert to millis for WindowedValue and Timers, which is related
>>>>>> >     to a bug I'm looking into:
>>>>>> >
>>>>>> >     https://issues.apache.org/jira/browse/BEAM-7035
>>>>>> >
>>>>>> >     As Luke pointed out, the issue was previously discussed:
>>>>>> >
>>>>>> >     https://issues.apache.org/jira/browse/BEAM-1524
>>>>>> >
>>>>>> >     I'm not privy to the reasons why we decided to go with micros in
>>>>>> >     first place, but would it be too big of a change or impractical for
>>>>>> >     other reasons to switch Python SDK to millis before it gets more 
>>>>>> > users?
>>>>>> >
>>>>>> >     Thanks,
>>>>>> >     Thomas
>>>>>> >

Re: Python SDK timestamp precision

Reply via email to