For Robert's benefit, I want to point out that my proposal is to support femtosecond data, with femtosecond-scale windows, even if watermarks/event timestamps/holds are only millisecond precision.
So the workaround once I have time, for SQL and schema-based transforms, will be to have a logical type that matches the Java and protobuf definition of nanos (seconds-since-epoch + nanos-in-second) to preserve the user's data. And then when doing windowing inserting the necessary rounding somewhere in the SQL or schema layers. Kenn On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <[email protected]> wrote: > +1 for plan B. Nano second precision on windowing seems... a little much > for a system that's aggregating data over time. Even for processing say > particle super collider data, they'd get away with artificially increasing > the granularity in batch settings. > > Now if they were streaming... they'd probably want femtoseconds anyway. > The point is, we should see if users demand it before adding in the > necessary work. > > On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath <[email protected]> > wrote: > >> +1 for plan B as well. I think it's important to make timestamp precision >> consistent now without introducing surprising behaviors for existing users. >> But we should move towards a higher granularity timestamp precision in the >> long run to support use-cases that Beam users otherwise might miss out (on >> a runner that supports such precision). >> >> - Cham >> >> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <[email protected]> wrote: >> >>> I also like Plan B because in the cross language case, the pipeline >>> would not work since every party (Runners & SDKs) would have to be aware of >>> the new beam:coder:windowed_value:v2 coder. Plan A has the property where >>> if the SDK/Runner wasn't updated then it may start truncating the >>> timestamps unexpectedly. >>> >>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <[email protected]> wrote: >>> >>>> Kenn, this discussion is about the precision of the timestamp in the >>>> user data. As you had mentioned, Runners need not have the same granularity >>>> of user data as long as they correctly round the timestamp to guarantee >>>> that triggers are executed correctly but the user data should have the same >>>> precision across SDKs otherwise user data timestamps will be truncated in >>>> cross language scenarios. >>>> >>>> Based on the systems that were listed, either microsecond or nanosecond >>>> would make sense. The issue with changing the precision is that all Beam >>>> runners except for possibly Beam Python on Dataflow are using millisecond >>>> precision since they are all using the same Java Runner windowing/trigger >>>> logic. >>>> >>>> Plan A: Swap precision to nanosecond >>>> 1) Change the Python SDK to only expose millisecond precision >>>> timestamps (do now) >>>> 2) Change the user data encoding to support nanosecond precision (do >>>> now) >>>> 3) Swap runner libraries to be nanosecond precision aware updating all >>>> window/triggering logic (do later) >>>> 4) Swap SDKs to expose nanosecond precision (do later) >>>> >>>> Plan B: >>>> 1) Change the Python SDK to only expose millisecond precision >>>> timestamps and keep the data encoding as is (do now) >>>> (We could add greater precision later to plan B by creating a new >>>> version beam:coder:windowed_value:v2 which would be nanosecond and would >>>> require runners to correctly perform an internal conversions for >>>> windowing/triggering.) >>>> >>>> I think we should go with Plan B and when users request greater >>>> precision we can make that an explicit effort. What do people think? >>>> >>>> >>>> >>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks for taking care of this issue in the Python SDK, Thomas! >>>>> >>>>> It would be nice to have a uniform precision for timestamps but, as >>>>> Kenn >>>>> pointed out, timestamps are extracted from systems that have different >>>>> precision. >>>>> >>>>> To add to the list: Flink - milliseconds >>>>> >>>>> After all, it doesn't matter as long as there is sufficient precision >>>>> and conversions are done correctly. >>>>> >>>>> I think we could improve the situation by at least adding a >>>>> "milliseconds" constructor to the Python SDK's Timestamp. >>>>> >>>>> Cheers, >>>>> Max >>>>> >>>>> On 17.04.19 04:13, Kenneth Knowles wrote: >>>>> > I am not so sure this is a good idea. Here are some systems and >>>>> their >>>>> > precision: >>>>> > >>>>> > Arrow - microseconds >>>>> > BigQuery - microseconds >>>>> > New Java instant - nanoseconds >>>>> > Firestore - microseconds >>>>> > Protobuf - nanoseconds >>>>> > Dataflow backend - microseconds >>>>> > Postgresql - microseconds >>>>> > Pubsub publish time - nanoseconds >>>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis) >>>>> > Cassandra - milliseconds >>>>> > >>>>> > IMO it is important to be able to treat any of these as a Beam >>>>> > timestamp, even though they aren't all streaming. Who knows when we >>>>> > might be ingesting a streamed changelog, or using them for >>>>> reprocessing >>>>> > an archived stream. I think for this purpose we either should >>>>> > standardize on nanoseconds or make the runner's resolution >>>>> independent >>>>> > of the data representation. >>>>> > >>>>> > I've had some offline conversations about this. I think we can have >>>>> > higher-than-runner precision in the user data, and allow WindowFns >>>>> and >>>>> > DoFns to operate on this higher-than-runner precision data, and >>>>> still >>>>> > have consistent watermark treatment. Watermarks are just bounds, >>>>> after all. >>>>> > >>>>> > Kenn >>>>> > >>>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <[email protected] >>>>> > <mailto:[email protected]>> wrote: >>>>> > >>>>> > The Python SDK currently uses timestamps in microsecond >>>>> resolution >>>>> > while Java SDK, as most would probably expect, uses milliseconds. >>>>> > >>>>> > This causes a few difficulties with portability (Python coders >>>>> need >>>>> > to convert to millis for WindowedValue and Timers, which is >>>>> related >>>>> > to a bug I'm looking into: >>>>> > >>>>> > https://issues.apache.org/jira/browse/BEAM-7035 >>>>> > >>>>> > As Luke pointed out, the issue was previously discussed: >>>>> > >>>>> > https://issues.apache.org/jira/browse/BEAM-1524 >>>>> > >>>>> > I'm not privy to the reasons why we decided to go with micros in >>>>> > first place, but would it be too big of a change or impractical >>>>> for >>>>> > other reasons to switch Python SDK to millis before it gets more >>>>> users? >>>>> > >>>>> > Thanks, >>>>> > Thomas >>>>> > >>>>> >>>>
