I am talking about the precision of timestamps in user data. I do not believe plan A or B address what I am saying. As a user, I have a kafka stream of Avro records with a timestamp-micros field. I need to be able to:
- tell Beam this is the event timestamp for the record - run my WindowFn against this field with original precision Consider SQL query that reads SELECT ... FROM stream GROUP BY TUMBLE(stream.micro_timestamp, INTERVAL 100 microseconds). - stream.micro_timestamp has to be made into the event timestamp - the time interval is small and you must have the original data in order to correctly assign the window - (I think both of these can be safely rounded under the hood) This is not just a SQL issue; the new schema-based transforms will hit the same issues. It is just easier to email SQL. And the same idea applies in pure Java, though forcing things to joda time forces users to deal with this. But see https://github.com/apache/beam/pull/8289 which I had to write to work around this. Some timestamp values that are observable to users that we can restrict more safely: - end time of a window - output of a timestamp combiner - firing time for an event time timer Kenn On Wed, Apr 17, 2019 at 2:26 PM Chamikara Jayalath <chamik...@google.com> wrote: > +1 for plan B as well. I think it's important to make timestamp precision > consistent now without introducing surprising behaviors for existing users. > But we should move towards a higher granularity timestamp precision in the > long run to support use-cases that Beam users otherwise might miss out (on > a runner that supports such precision). > > - Cham > > On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <lc...@google.com> wrote: > >> I also like Plan B because in the cross language case, the pipeline would >> not work since every party (Runners & SDKs) would have to be aware of the >> new beam:coder:windowed_value:v2 coder. Plan A has the property where if >> the SDK/Runner wasn't updated then it may start truncating the timestamps >> unexpectedly. >> >> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> wrote: >> >>> Kenn, this discussion is about the precision of the timestamp in the >>> user data. As you had mentioned, Runners need not have the same granularity >>> of user data as long as they correctly round the timestamp to guarantee >>> that triggers are executed correctly but the user data should have the same >>> precision across SDKs otherwise user data timestamps will be truncated in >>> cross language scenarios. >>> >>> Based on the systems that were listed, either microsecond or nanosecond >>> would make sense. The issue with changing the precision is that all Beam >>> runners except for possibly Beam Python on Dataflow are using millisecond >>> precision since they are all using the same Java Runner windowing/trigger >>> logic. >>> >>> Plan A: Swap precision to nanosecond >>> 1) Change the Python SDK to only expose millisecond precision timestamps >>> (do now) >>> 2) Change the user data encoding to support nanosecond precision (do now) >>> 3) Swap runner libraries to be nanosecond precision aware updating all >>> window/triggering logic (do later) >>> 4) Swap SDKs to expose nanosecond precision (do later) >>> >>> Plan B: >>> 1) Change the Python SDK to only expose millisecond precision timestamps >>> and keep the data encoding as is (do now) >>> (We could add greater precision later to plan B by creating a new >>> version beam:coder:windowed_value:v2 which would be nanosecond and would >>> require runners to correctly perform an internal conversions for >>> windowing/triggering.) >>> >>> I think we should go with Plan B and when users request greater >>> precision we can make that an explicit effort. What do people think? >>> >>> >>> >>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org> >>> wrote: >>> >>>> Hi, >>>> >>>> Thanks for taking care of this issue in the Python SDK, Thomas! >>>> >>>> It would be nice to have a uniform precision for timestamps but, as >>>> Kenn >>>> pointed out, timestamps are extracted from systems that have different >>>> precision. >>>> >>>> To add to the list: Flink - milliseconds >>>> >>>> After all, it doesn't matter as long as there is sufficient precision >>>> and conversions are done correctly. >>>> >>>> I think we could improve the situation by at least adding a >>>> "milliseconds" constructor to the Python SDK's Timestamp. >>>> >>>> Cheers, >>>> Max >>>> >>>> On 17.04.19 04:13, Kenneth Knowles wrote: >>>> > I am not so sure this is a good idea. Here are some systems and their >>>> > precision: >>>> > >>>> > Arrow - microseconds >>>> > BigQuery - microseconds >>>> > New Java instant - nanoseconds >>>> > Firestore - microseconds >>>> > Protobuf - nanoseconds >>>> > Dataflow backend - microseconds >>>> > Postgresql - microseconds >>>> > Pubsub publish time - nanoseconds >>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis) >>>> > Cassandra - milliseconds >>>> > >>>> > IMO it is important to be able to treat any of these as a Beam >>>> > timestamp, even though they aren't all streaming. Who knows when we >>>> > might be ingesting a streamed changelog, or using them for >>>> reprocessing >>>> > an archived stream. I think for this purpose we either should >>>> > standardize on nanoseconds or make the runner's resolution >>>> independent >>>> > of the data representation. >>>> > >>>> > I've had some offline conversations about this. I think we can have >>>> > higher-than-runner precision in the user data, and allow WindowFns >>>> and >>>> > DoFns to operate on this higher-than-runner precision data, and still >>>> > have consistent watermark treatment. Watermarks are just bounds, >>>> after all. >>>> > >>>> > Kenn >>>> > >>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org >>>> > <mailto:t...@apache.org>> wrote: >>>> > >>>> > The Python SDK currently uses timestamps in microsecond resolution >>>> > while Java SDK, as most would probably expect, uses milliseconds. >>>> > >>>> > This causes a few difficulties with portability (Python coders >>>> need >>>> > to convert to millis for WindowedValue and Timers, which is >>>> related >>>> > to a bug I'm looking into: >>>> > >>>> > https://issues.apache.org/jira/browse/BEAM-7035 >>>> > >>>> > As Luke pointed out, the issue was previously discussed: >>>> > >>>> > https://issues.apache.org/jira/browse/BEAM-1524 >>>> > >>>> > I'm not privy to the reasons why we decided to go with micros in >>>> > first place, but would it be too big of a change or impractical >>>> for >>>> > other reasons to switch Python SDK to millis before it gets more >>>> users? >>>> > >>>> > Thanks, >>>> > Thomas >>>> > >>>> >>>