Re: Python SDK timestamp precision

Chamikara Jayalath Wed, 17 Apr 2019 14:26:13 -0700

+1 for plan B as well. I think it's important to make timestamp precision
consistent now without introducing surprising behaviors for existing users.
But we should move towards a higher granularity timestamp precision in the
long run to support use-cases that Beam users otherwise might miss out (on
a runner that supports such precision).


- Cham

On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <[email protected]> wrote:

> I also like Plan B because in the cross language case, the pipeline would
> not work since every party (Runners & SDKs) would have to be aware of the
> new beam:coder:windowed_value:v2 coder. Plan A has the property where if
> the SDK/Runner wasn't updated then it may start truncating the timestamps
> unexpectedly.
>
> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <[email protected]> wrote:
>
>> Kenn, this discussion is about the precision of the timestamp in the user
>> data. As you had mentioned, Runners need not have the same granularity of
>> user data as long as they correctly round the timestamp to guarantee that
>> triggers are executed correctly but the user data should have the same
>> precision across SDKs otherwise user data timestamps will be truncated in
>> cross language scenarios.
>>
>> Based on the systems that were listed, either microsecond or nanosecond
>> would make sense. The issue with changing the precision is that all Beam
>> runners except for possibly Beam Python on Dataflow are using millisecond
>> precision since they are all using the same Java Runner windowing/trigger
>> logic.
>>
>> Plan A: Swap precision to nanosecond
>> 1) Change the Python SDK to only expose millisecond precision timestamps
>> (do now)
>> 2) Change the user data encoding to support nanosecond precision (do now)
>> 3) Swap runner libraries to be nanosecond precision aware updating all
>> window/triggering logic (do later)
>> 4) Swap SDKs to expose nanosecond precision (do later)
>>
>> Plan B:
>> 1) Change the Python SDK to only expose millisecond precision timestamps
>> and keep the data encoding as is (do now)
>> (We could add greater precision later to plan B by creating a new version
>> beam:coder:windowed_value:v2 which would be nanosecond and would require
>> runners to correctly perform an internal conversions for
>> windowing/triggering.)
>>
>> I think we should go with Plan B and when users request greater precision
>> we can make that an explicit effort. What do people think?
>>
>>
>>
>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>>
>>> It would be nice to have a uniform precision for timestamps but, as Kenn
>>> pointed out, timestamps are extracted from systems that have different
>>> precision.
>>>
>>> To add to the list: Flink - milliseconds
>>>
>>> After all, it doesn't matter as long as there is sufficient precision
>>> and conversions are done correctly.
>>>
>>> I think we could improve the situation by at least adding a
>>> "milliseconds" constructor to the Python SDK's Timestamp.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 17.04.19 04:13, Kenneth Knowles wrote:
>>> > I am not so sure this is a good idea. Here are some systems and their
>>> > precision:
>>> >
>>> > Arrow - microseconds
>>> > BigQuery - microseconds
>>> > New Java instant - nanoseconds
>>> > Firestore - microseconds
>>> > Protobuf - nanoseconds
>>> > Dataflow backend - microseconds
>>> > Postgresql - microseconds
>>> > Pubsub publish time - nanoseconds
>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>>> > Cassandra - milliseconds
>>> >
>>> > IMO it is important to be able to treat any of these as a Beam
>>> > timestamp, even though they aren't all streaming. Who knows when we
>>> > might be ingesting a streamed changelog, or using them for
>>> reprocessing
>>> > an archived stream. I think for this purpose we either should
>>> > standardize on nanoseconds or make the runner's resolution independent
>>> > of the data representation.
>>> >
>>> > I've had some offline conversations about this. I think we can have
>>> > higher-than-runner precision in the user data, and allow WindowFns and
>>> > DoFns to operate on this higher-than-runner precision data, and still
>>> > have consistent watermark treatment. Watermarks are just bounds, after
>>> all.
>>> >
>>> > Kenn
>>> >
>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <[email protected]
>>> > <mailto:[email protected]>> wrote:
>>> >
>>> >     The Python SDK currently uses timestamps in microsecond resolution
>>> >     while Java SDK, as most would probably expect, uses milliseconds.
>>> >
>>> >     This causes a few difficulties with portability (Python coders need
>>> >     to convert to millis for WindowedValue and Timers, which is related
>>> >     to a bug I'm looking into:
>>> >
>>> >     https://issues.apache.org/jira/browse/BEAM-7035
>>> >
>>> >     As Luke pointed out, the issue was previously discussed:
>>> >
>>> >     https://issues.apache.org/jira/browse/BEAM-1524
>>> >
>>> >     I'm not privy to the reasons why we decided to go with micros in
>>> >     first place, but would it be too big of a change or impractical for
>>> >     other reasons to switch Python SDK to millis before it gets more
>>> users?
>>> >
>>> >     Thanks,
>>> >     Thomas
>>> >
>>>
>>

Re: Python SDK timestamp precision

Reply via email to