Re: Python SDK timestamp precision

Kenneth Knowles Wed, 17 Apr 2019 15:23:53 -0700

For Robert's benefit, I want to point out that my proposal is to support
femtosecond data, with femtosecond-scale windows, even if watermarks/event
timestamps/holds are only millisecond precision.


So the workaround once I have time, for SQL and schema-based transforms,
will be to have a logical type that matches the Java and protobuf
definition of nanos (seconds-since-epoch + nanos-in-second) to preserve the
user's data. And then when doing windowing inserting the necessary rounding
somewhere in the SQL or schema layers.

Kenn

On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <[email protected]> wrote:

> +1 for plan B. Nano second precision on windowing seems... a little much
> for a system that's aggregating data over time. Even for processing say
> particle super collider data, they'd get away with artificially increasing
> the granularity in batch settings.
>
> Now if they were streaming... they'd probably want femtoseconds anyway.
> The point is, we should see if users demand it before adding in the
> necessary work.
>
> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath <[email protected]>
> wrote:
>
>> +1 for plan B as well. I think it's important to make timestamp precision
>> consistent now without introducing surprising behaviors for existing users.
>> But we should move towards a higher granularity timestamp precision in the
>> long run to support use-cases that Beam users otherwise might miss out (on
>> a runner that supports such precision).
>>
>> - Cham
>>
>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <[email protected]> wrote:
>>
>>> I also like Plan B because in the cross language case, the pipeline
>>> would not work since every party (Runners & SDKs) would have to be aware of
>>> the new beam:coder:windowed_value:v2 coder. Plan A has the property where
>>> if the SDK/Runner wasn't updated then it may start truncating the
>>> timestamps unexpectedly.
>>>
>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Kenn, this discussion is about the precision of the timestamp in the
>>>> user data. As you had mentioned, Runners need not have the same granularity
>>>> of user data as long as they correctly round the timestamp to guarantee
>>>> that triggers are executed correctly but the user data should have the same
>>>> precision across SDKs otherwise user data timestamps will be truncated in
>>>> cross language scenarios.
>>>>
>>>> Based on the systems that were listed, either microsecond or nanosecond
>>>> would make sense. The issue with changing the precision is that all Beam
>>>> runners except for possibly Beam Python on Dataflow are using millisecond
>>>> precision since they are all using the same Java Runner windowing/trigger
>>>> logic.
>>>>
>>>> Plan A: Swap precision to nanosecond
>>>> 1) Change the Python SDK to only expose millisecond precision
>>>> timestamps (do now)
>>>> 2) Change the user data encoding to support nanosecond precision (do
>>>> now)
>>>> 3) Swap runner libraries to be nanosecond precision aware updating all
>>>> window/triggering logic (do later)
>>>> 4) Swap SDKs to expose nanosecond precision (do later)
>>>>
>>>> Plan B:
>>>> 1) Change the Python SDK to only expose millisecond precision
>>>> timestamps and keep the data encoding as is (do now)
>>>> (We could add greater precision later to plan B by creating a new
>>>> version beam:coder:windowed_value:v2 which would be nanosecond and would
>>>> require runners to correctly perform an internal conversions for
>>>> windowing/triggering.)
>>>>
>>>> I think we should go with Plan B and when users request greater
>>>> precision we can make that an explicit effort. What do people think?
>>>>
>>>>
>>>>
>>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>>>>
>>>>> It would be nice to have a uniform precision for timestamps but, as
>>>>> Kenn
>>>>> pointed out, timestamps are extracted from systems that have different
>>>>> precision.
>>>>>
>>>>> To add to the list: Flink - milliseconds
>>>>>
>>>>> After all, it doesn't matter as long as there is sufficient precision
>>>>> and conversions are done correctly.
>>>>>
>>>>> I think we could improve the situation by at least adding a
>>>>> "milliseconds" constructor to the Python SDK's Timestamp.
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>> On 17.04.19 04:13, Kenneth Knowles wrote:
>>>>> > I am not so sure this is a good idea. Here are some systems and
>>>>> their
>>>>> > precision:
>>>>> >
>>>>> > Arrow - microseconds
>>>>> > BigQuery - microseconds
>>>>> > New Java instant - nanoseconds
>>>>> > Firestore - microseconds
>>>>> > Protobuf - nanoseconds
>>>>> > Dataflow backend - microseconds
>>>>> > Postgresql - microseconds
>>>>> > Pubsub publish time - nanoseconds
>>>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>>>>> > Cassandra - milliseconds
>>>>> >
>>>>> > IMO it is important to be able to treat any of these as a Beam
>>>>> > timestamp, even though they aren't all streaming. Who knows when we
>>>>> > might be ingesting a streamed changelog, or using them for
>>>>> reprocessing
>>>>> > an archived stream. I think for this purpose we either should
>>>>> > standardize on nanoseconds or make the runner's resolution
>>>>> independent
>>>>> > of the data representation.
>>>>> >
>>>>> > I've had some offline conversations about this. I think we can have
>>>>> > higher-than-runner precision in the user data, and allow WindowFns
>>>>> and
>>>>> > DoFns to operate on this higher-than-runner precision data, and
>>>>> still
>>>>> > have consistent watermark treatment. Watermarks are just bounds,
>>>>> after all.
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <[email protected]
>>>>> > <mailto:[email protected]>> wrote:
>>>>> >
>>>>> >     The Python SDK currently uses timestamps in microsecond
>>>>> resolution
>>>>> >     while Java SDK, as most would probably expect, uses milliseconds.
>>>>> >
>>>>> >     This causes a few difficulties with portability (Python coders
>>>>> need
>>>>> >     to convert to millis for WindowedValue and Timers, which is
>>>>> related
>>>>> >     to a bug I'm looking into:
>>>>> >
>>>>> >     https://issues.apache.org/jira/browse/BEAM-7035
>>>>> >
>>>>> >     As Luke pointed out, the issue was previously discussed:
>>>>> >
>>>>> >     https://issues.apache.org/jira/browse/BEAM-1524
>>>>> >
>>>>> >     I'm not privy to the reasons why we decided to go with micros in
>>>>> >     first place, but would it be too big of a change or impractical
>>>>> for
>>>>> >     other reasons to switch Python SDK to millis before it gets more
>>>>> users?
>>>>> >
>>>>> >     Thanks,
>>>>> >     Thomas
>>>>> >
>>>>>
>>>>

Re: Python SDK timestamp precision

Reply via email to