+1 for plan B as well. I think it's important to make timestamp precision consistent now without introducing surprising behaviors for existing users. But we should move towards a higher granularity timestamp precision in the long run to support use-cases that Beam users otherwise might miss out (on a runner that supports such precision).
- Cham On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <[email protected]> wrote: > I also like Plan B because in the cross language case, the pipeline would > not work since every party (Runners & SDKs) would have to be aware of the > new beam:coder:windowed_value:v2 coder. Plan A has the property where if > the SDK/Runner wasn't updated then it may start truncating the timestamps > unexpectedly. > > On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <[email protected]> wrote: > >> Kenn, this discussion is about the precision of the timestamp in the user >> data. As you had mentioned, Runners need not have the same granularity of >> user data as long as they correctly round the timestamp to guarantee that >> triggers are executed correctly but the user data should have the same >> precision across SDKs otherwise user data timestamps will be truncated in >> cross language scenarios. >> >> Based on the systems that were listed, either microsecond or nanosecond >> would make sense. The issue with changing the precision is that all Beam >> runners except for possibly Beam Python on Dataflow are using millisecond >> precision since they are all using the same Java Runner windowing/trigger >> logic. >> >> Plan A: Swap precision to nanosecond >> 1) Change the Python SDK to only expose millisecond precision timestamps >> (do now) >> 2) Change the user data encoding to support nanosecond precision (do now) >> 3) Swap runner libraries to be nanosecond precision aware updating all >> window/triggering logic (do later) >> 4) Swap SDKs to expose nanosecond precision (do later) >> >> Plan B: >> 1) Change the Python SDK to only expose millisecond precision timestamps >> and keep the data encoding as is (do now) >> (We could add greater precision later to plan B by creating a new version >> beam:coder:windowed_value:v2 which would be nanosecond and would require >> runners to correctly perform an internal conversions for >> windowing/triggering.) >> >> I think we should go with Plan B and when users request greater precision >> we can make that an explicit effort. What do people think? >> >> >> >> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <[email protected]> >> wrote: >> >>> Hi, >>> >>> Thanks for taking care of this issue in the Python SDK, Thomas! >>> >>> It would be nice to have a uniform precision for timestamps but, as Kenn >>> pointed out, timestamps are extracted from systems that have different >>> precision. >>> >>> To add to the list: Flink - milliseconds >>> >>> After all, it doesn't matter as long as there is sufficient precision >>> and conversions are done correctly. >>> >>> I think we could improve the situation by at least adding a >>> "milliseconds" constructor to the Python SDK's Timestamp. >>> >>> Cheers, >>> Max >>> >>> On 17.04.19 04:13, Kenneth Knowles wrote: >>> > I am not so sure this is a good idea. Here are some systems and their >>> > precision: >>> > >>> > Arrow - microseconds >>> > BigQuery - microseconds >>> > New Java instant - nanoseconds >>> > Firestore - microseconds >>> > Protobuf - nanoseconds >>> > Dataflow backend - microseconds >>> > Postgresql - microseconds >>> > Pubsub publish time - nanoseconds >>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis) >>> > Cassandra - milliseconds >>> > >>> > IMO it is important to be able to treat any of these as a Beam >>> > timestamp, even though they aren't all streaming. Who knows when we >>> > might be ingesting a streamed changelog, or using them for >>> reprocessing >>> > an archived stream. I think for this purpose we either should >>> > standardize on nanoseconds or make the runner's resolution independent >>> > of the data representation. >>> > >>> > I've had some offline conversations about this. I think we can have >>> > higher-than-runner precision in the user data, and allow WindowFns and >>> > DoFns to operate on this higher-than-runner precision data, and still >>> > have consistent watermark treatment. Watermarks are just bounds, after >>> all. >>> > >>> > Kenn >>> > >>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <[email protected] >>> > <mailto:[email protected]>> wrote: >>> > >>> > The Python SDK currently uses timestamps in microsecond resolution >>> > while Java SDK, as most would probably expect, uses milliseconds. >>> > >>> > This causes a few difficulties with portability (Python coders need >>> > to convert to millis for WindowedValue and Timers, which is related >>> > to a bug I'm looking into: >>> > >>> > https://issues.apache.org/jira/browse/BEAM-7035 >>> > >>> > As Luke pointed out, the issue was previously discussed: >>> > >>> > https://issues.apache.org/jira/browse/BEAM-1524 >>> > >>> > I'm not privy to the reasons why we decided to go with micros in >>> > first place, but would it be too big of a change or impractical for >>> > other reasons to switch Python SDK to millis before it gets more >>> users? >>> > >>> > Thanks, >>> > Thomas >>> > >>> >>
