Re: Python SDK timestamp precision

Robert Bradshaw Wed, 30 Oct 2019 14:33:13 -0700

On Wed, Oct 30, 2019 at 2:00 AM Jan Lukavský <je...@seznam.cz> wrote:
>
> TL;DR - can we solve this by representing aggregations as not point-wise
> events in time, but time ranges? Explanation below.
>
> Hi,
>
> this is pretty interesting from a theoretical point of view. The
> question generally seems to be - having two events, can I reliably order
> them? One event might be end of one window and the other event might be
> start of another window. There is strictly required causality in this
> (although these events in fact have the same limiting timestamp). On the
> other hand, when I have two (random) point-wise events (one with
> timestamp T1 and the other with timestamp T2), then causality of these
> two depend on distance in space. If these two events differ by single
> nanosecond, then if they do not originate very "close" to each other,
> then there is high probability that different observers will see them in
> different order (and hence there is no causality possible and the order
> doesn't matter).
>
> That is to say - if I'm dealing with single external event (someone
> tells me something has happened at time T), then - around the boundaries
> of windows - it doesn't matter which window is this event placed into.
> There is no causal connection between start of window and the event.


I resolve this theoretical issue by working in a space with a single
time dimension and zero spatial dimensions. That is, the location of
an event is completely determined by a single coordinate in time, and
distance and causality are hence well defined for all possible
observers :). Realistically, there is both error and ambiguity in
assigning absolute timestamps to real-world events, but it's important
that this choice of ordering should be preserved (e.g. not compressed
away) by the system.

> Different situation is when we actually perform an aggregation on
> multiple elements (GBK) - then although we produce output at certain
> event-time timestamp, this event is not "point-wise external", but is a
> running aggregation on multiple (and possibly long term) data. That is
> of course causally preceding opening a new (tumbling) window.
>
> The question now is - could we extend WindowedValue so that it doesn't
> contain only single point-wise event in time, but a time range in case
> of aggregations? Then the example of Window.into -> GBK -> Window.into
> could actually behave correctly (respecting causality) and another
> positive thing could be, that the resulting timestamp of the aggregation
> would no longer have to be window.maxTimestamp - 1 (which is kind of
> strange).
>
> This might be actually equivalent to the "minus epsilon" approach, but
> maybe better understandable. Moreover, this should be probably available
> to users, because one can easily get to the same problems when using
> stateful ParDo to perform a GBK-like aggregation.
>
> Thoughts?

This is quite an interesting idea. In some sense, timestamps become
like interval windows, and window assignment is similar to the window
mapping fns that we use for side inputs. I still think the idea of a
timestmap for an element and a window for an element is needed (e.g.
one can have elements in a single window, especially the global window
that have different timestamps), but this could be interesting to
explore. It could definitely get rid of the "minus epsilon" weirdness,
though I don't think it completely solves the granularity issues.

> On 10/30/19 1:05 AM, Robert Bradshaw wrote:
> > On Tue, Oct 29, 2019 at 4:20 PM Kenneth Knowles <k...@apache.org> wrote:
> >> Point (1) is compelling. Solutions to the "minus epsilon" seem a bit 
> >> complex. On the other hand, an opaque and abstract Timestamp type (in each 
> >> SDK) going forward seems like a Pretty Good Idea (tm). Would you really 
> >> have to go floating point? Could you just have a distinguished 
> >> representation for non-inclusive upper/lower bounds? These could be at the 
> >> same reduced resolution as timestamps in element metadata, since that is 
> >> all they are compared against.
> > If I were coming up with an abstract, opaque representation of
> > Timestamp (and Duration) for Beam, I would explicitly include the
> > "minus epsilon" concept. One could still do arithmetic with these.
> > This would make any conversion to standard datetime libraries lossy
> > though.
> >
> >> Point (2) is also good, though it seems like something that could be 
> >> cleverly engineered and/or we just provide one implementation and it is 
> >> easy to make your own for finer granularity, since a WindowFn separately 
> >> receives the Timestamp (here I'm pretending it is abstract and opaque and 
> >> likely approximate) and the original element with whatever precision the 
> >> original data included.
> > Yes, but I don't see how a generic WindowFn would reach into the
> > (arbitrary) element and pull out this original data. One of the
> > benefits of the Beam model is that the WindowFn does not have to
> > depend on the element type.
> >
> >> Point (3) the model/runner owns the timestamp metadata so I feel fine 
> >> about it being approximated as long as any original user data is still 
> >> present. I don't recall seeing a compelling case where the timestamp 
> >> metadata that the runner tracks and understands is required to be exactly 
> >> the same as a user value (assuming users understand this distinction, 
> >> which is another issue that I would separate from whether it is 
> >> technically feasible).
> > As we provide the ability to designate user data as the runner
> > timestamp against which to window, and promote the runner timestamp
> > back to user data (people are going to want to get DateTime or Instant
> > objects out of it), it seems tricky to explain to users that one or
> > both of these operations may be lossy (and, in addition, I don't think
> > there's a consistently safe direction to round).
> >
> >> The more I think about the very real problems you point out, the more I 
> >> think that our backwards-incompatible move should be to our own abstract 
> >> Timestamp type, putting the design decision behind a minimal interface. If 
> >> we see a concrete design for that data type, we might be inspired how to 
> >> support more possibilities.
> >>
> >> As for the rest of the speculation... moving to nanos immediately helps 
> >> users so I am now +1 on just doing it, or moving ahead with an abstract 
> >> data type under the assumption that it will basically be nanos under the 
> >> hood.
> > If the fact that it's stored as nanos under the hood leaks out (and I
> > have trouble seeing how it won't) I'd lean towards just using them
> > directly (e.g. Java Instant) rather than wrapping it.
> >
> >> Having a cleverly resolution-independent system is interesting and maybe 
> >> extremely future proof but maybe preparing for a very distant future that 
> >> may never come.
> >>
> >> Kenn
> >>
> >> On Fri, Oct 18, 2019 at 11:35 AM Robert Bradshaw <rober...@google.com> 
> >> wrote:
> >>> TL;DR: We should just settle on nanosecond precision ubiquitously for 
> >>> timestamp/windowing in Beam.
> >>>
> >>>
> >>> Re-visiting this discussion in light of cross-language transforms and 
> >>> runners, and trying to tighten up testing. I've spent some more time 
> >>> thinking about how we could make these operations granularity-agnostic, 
> >>> but just can't find a good solution. In particular, the sticklers seem to 
> >>> be:
> >>>
> >>> (1) Windows are half-open intervals, and the timestamp associated with a 
> >>> window coming out of a GBK is (by default) as large as possible but must 
> >>> live in that window. (Otherwise WindowInto + GBK + WindowInto would have 
> >>> the unforunate effect of moving aggregate values into subsequent windows, 
> >>> which is clearly not the intent.) In other words, the timestamp of a 
> >>> grouped value is basically End(Window) - epsilon. Unless we choose a 
> >>> representation able to encode "minus epsilon" we must agree on a 
> >>> granularity.
> >>>
> >>> (2) Unless we want to have multiple vairants of all our WindowFns (e.g. 
> >>> FixedWindowMillis, FixedWindowMicros, FixedWindowNanos) we must agree on 
> >>> a granularity with which to parameterize these well-known operations. 
> >>> There are cases (e.g. side input window mapping, merging) where these Fns 
> >>> may be used downstream in contexts other than where they are 
> >>> applied/defined.
> >>>
> >>> (3) Reification of the timestamp into user-visible data, and the other 
> >>> way around, require a choice of precision to expose to the user. This 
> >>> means that the timestamp is actual data, and truncating/rounding cannot 
> >>> be done implicitly. Also round trip of reification and application of 
> >>> timestamps should hopefully be idempotent no matter the SDK.
> >>>
> >>> The closest I've come is possibly parameterizing the timestamp type, 
> >>> where encoding, decoding (including pulling the end out of a window?), 
> >>> comparison (against each other and a watermark), "minus epsilon", etc 
> >>> could be UDFs. Possibly we'd need the full set of arithmetic operations 
> >>> to implement FixedWindows on an unknown timestamp type. Reification would 
> >>> simply be dis-allowed (or return an opaque rather than SDK-native) type 
> >>> if the SDK did not know that window type. The fact that one might need 
> >>> comparison between timestamps of different types, or (lossless) coercion 
> >>> from one type to another, means that timestamp types need to know about 
> >>> each other, or another entity needs to know about the full cross-product, 
> >>> unless there is a common base-type (at which point we might as well 
> >>> always choose that).
> >>>
> >>> An intermediate solution is to settle on floating (decimal) point 
> >>> representation, plus a "minus-epsiloin" bit. It wouldn't quite solve the 
> >>> mapping through SDK-native types (which could require rounding or errors 
> >>> or a new opaque type, and few date librarys could faithfully expose the 
> >>> minus epsilon part). It might also be more expensive (compute and 
> >>> storage), and would not allow us to use the protofuf timestamp/duration 
> >>> fields (or any standard date/time libraries).
> >>>
> >>> Unless we can come up with a clean solution to the issues above shortly, 
> >>> I think we should fix a precision and move forward. If this makes sense 
> >>> to everyone, then we can start talking about the specific choice of 
> >>> precision and a migration path (possibly only for portability).
> >>>
> >>>
> >>> For reference, the manipulations we do on timestamps are:
> >>>
> >>> WindowInto: Timestamp -> Window
> >>> TimestampCombine: Window, [Timestamp] -> Timestamp
> >>>      End(Window)
> >>>      Min(Timestamps)
> >>>      Max(Timestamps)
> >>> PastEndOfWindow: Watermark, Window -> {True, False}
> >>>
> >>> [SideInput]WindowMappingFn: Window -> Window
> >>>      WindowInto(End(Window))
> >>>
> >>> GetTimestamp: Timestamp -> SDK Native Object
> >>> EmitAtTimestamp: SDK Native Object -> Timestamp
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, May 10, 2019 at 1:33 PM Robert Bradshaw <rober...@google.com> 
> >>> wrote:
> >>>> On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles <k...@apache.org> 
> >>>> wrote:
> >>>>
> >>>>> From: Robert Bradshaw <rober...@google.com>
> >>>>> Date: Wed, May 8, 2019 at 3:00 PM
> >>>>> To: dev
> >>>>>
> >>>>>> From: Kenneth Knowles <k...@apache.org>
> >>>>>> Date: Wed, May 8, 2019 at 6:50 PM
> >>>>>> To: dev
> >>>>>>
> >>>>>>>> The end-of-window, for firing, can be approximate, but it seems it
> >>>>>>>> should be exact for timestamp assignment of the result (and similarly
> >>>>>>>> with the other timestamp combiners).
> >>>>>>> I was thinking that the window itself should be stored as exact data, 
> >>>>>>> while just the firing itself is approximated, since it already is, 
> >>>>>>> because of watermarks and timers.
> >>>>>> I think this works where we can compare encoded windows, but some
> >>>>>> portable interpretation of windows is required for runner-side
> >>>>>> implementation of merging windows (for example).
> >>>>> But in this case, you've recognized the URN of the WindowFn anyhow, so 
> >>>>> you understand its windows. Remembering that IntervalWindow is just one 
> >>>>> choice, and that windows themselves are totally user-defined and that 
> >>>>> merging logic is completely arbitrary per WindowFn (we probably should 
> >>>>> have some restrictions, but see 
> >>>>> https://issues.apache.org/jira/browse/BEAM-654). So I file this use 
> >>>>> case in the "runner knows everything about the WindowFn and Window type 
> >>>>> and window encoding anyhow".
> >>>> Being able to merge common windows in the runner is just an
> >>>> optimization, but an important one (especially for bootstrapping
> >>>> SDKs). However, this is not just about runner to SDK, but SDK to SDK
> >>>> as well (where a user from one SDK may want to inspect the windows
> >>>> produced by another). Having MillisIntervalWindow,
> >>>> MicrosIntervalWindow, NanosIntervalWindow, etc. isn't a path that I
> >>>> think is worth going down.
> >>>>
> >>>> Yes, we need to solve the "extract the endpoint of an unknown encoded
> >>>> window" problem as well, possibly similar to what we do with length
> >>>> prefix coders, possibly a restriction on window encodings themselves.
> >>>>
> >>>>>> There may also be issues if windows (or timestamps) are assigned to a
> >>>>>> high precision in one SDK, then inspected/acted on in another SDK, and
> >>>>>> then passed back to the original SDK where the truncation would be
> >>>>>> visible.
> >>>>> This is pretty interesting and complex. But again, a window is just 
> >>>>> data. An SDK has to know how to deserialize it to operate on it. Unless 
> >>>>> we do actually standardize some aspects of it. I don't believe 
> >>>>> BoundedWindow encoding has a defined way to get the timestamp without 
> >>>>> decoding the window, does it? I thought we had basically default to all 
> >>>>> InternalWindows. But I am not following that closely.
> >>>>>
> >>>>>>> You raise a good point that min/max timestamp combiners require 
> >>>>>>> actually understanding the higher-precision timestamp. I can think of 
> >>>>>>> a couple things to do. One is the old "standardize all 3 or for 
> >>>>>>> precisions we need" and the other is that combiners other than EOW 
> >>>>>>> exist primarily to hold the watermark, and that hold does not require 
> >>>>>>> the original precision. Still, neither of these is that satisfying.
> >>>>>> In the current model, the output timestamp is user-visible.
> >>>>> But as long as the watermark hold is less, it is safe. It requires 
> >>>>> knowing the coarse-precision lower bound of the timestamps of the 
> >>>>> input. And there may be situations where you also want the coarse upper 
> >>>>> bound. But you do know that these are at most one millisecond apart 
> >>>>> (assuming the runner is in millis) so perhaps no storage overhead. But 
> >>>>> a lot of complexity and chances for off by ones. And this is pretty 
> >>>>> hand-wavy.
> >>>> Yeah. A different SDK may (implicitly or explicitly) ask for the
> >>>> timestamp of the (transitive) output of a GBK, for which an
> >>>> approximation (either way) is undesirable.
> >>>>
> >>>>>>>>> A correction: Java *now* uses nanoseconds [1]. It uses the same 
> >>>>>>>>> breakdown as proto (int64 seconds since epoch + int32 nanos within 
> >>>>>>>>> second). It has legacy classes that use milliseconds, and Joda 
> >>>>>>>>> itself now encourages moving back to Java's new Instant type. 
> >>>>>>>>> Nanoseconds should complicate the arithmetic only for the one 
> >>>>>>>>> person authoring the date library, which they have already done.
> >>>>>>>> The encoding and decoding need to be done in a language-consistent 
> >>>>>>>> way
> >>>>>>>> as well.
> >>>>>>> I honestly am not sure what you mean by "language-consistent" here.
> >>>>>> If we want to make reading and writing of timestamps, windows
> >>>>>> cross-language, we can't rely on language-specific libraries to do the
> >>>>>> encoding.
> >>>>>>
> >>>>>>>> Also, most date libraries don't division, etc. operators, so
> >>>>>>>> we have to do that as well. Not that it should be *that* hard.
> >>>>>>> If the libraries dedicated to time handling haven't found it needful, 
> >>>>>>> is there a specific reason you raise this? We do some simple math to 
> >>>>>>> find the window things fall into; is that it?
> >>>>>> Yes. E.g.
> >>>>>>
> >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/FixedWindows.java#L77
> >>>>>>
> >>>>>> would be a lot messier if there were no mapping date libraries to raw
> >>>>>> ints that we can do arithmetic on. Writing this with the (seconds,
> >>>>>> nanos) representation is painful. But I suppose we'd only have to do
> >>>>>> it once per SDK.
> >>>>>
> >>>>> Yea I think that arithmetic is not so bad. But this raises the issue of 
> >>>>> writing a *generic* WindowFn where its idea of timestamp granularity 
> >>>>> (the WindowFn owns the window type and encoding) may not match the user 
> >>>>> data coming in. So you need to apply the approximation function to 
> >>>>> provide type-correct input to the WindowFn. That's kind of exciting and 
> >>>>> weird and perhaps unsolvable, except by choosing a concrete granularity.
> >>>>>
> >>>>> Kenn
> >>>>>
> >>>>>>
> >>>>>>>>>> It would also be really nice to clean up the infinite-future being 
> >>>>>>>>>> the
> >>>>>>>>>> somewhat arbitrary max micros rounded to millis, and
> >>>>>>>>>> end-of-global-window being infinite-future minus 1 hour (IIRC), 
> >>>>>>>>>> etc.
> >>>>>>>>>> as well as the ugly logic in Python to cope with millis-micros
> >>>>>>>>>> conversion.
> >>>>>>>>> I actually don't have a problem with this. If you are trying to 
> >>>>>>>>> keep the representation compact, not add bytes on top of instants, 
> >>>>>>>>> then you just have to choose magic numbers, right?
> >>>>>>>> It's not about compactness, it's the (historically-derived?)
> >>>>>>>> arbitrariness of the numbers.
> >>>>>>> What I mean is that the only reason to fit them into an integer at 
> >>>>>>> all is compactness. Otherwise, you could use a proper disjoint union 
> >>>>>>> representing your intent directly, and all fiddling goes away, like 
> >>>>>>> `Timestamp ::= PosInf | NegInf | EndOfGlobalWindow | 
> >>>>>>> ActualTime(Instant)`. It costs a couple of bits.
> >>>>>> The other cost is not being able to use standard libraries to
> >>>>>> represent all of your timestamps.
> >>>>>>
> >>>>>>>> For example, the bounds are chosen to
> >>>>>>>> fit within 64-bit mircos despite milliseconds being the "chosen"
> >>>>>>>> granularity, and care was taken that
> >>>>>>>>
> >>>>>>>>      WindowInto(Global) | GBK | WindowInto(Minute) | GBK
> >>>>>>>>
> >>>>>>>> works, but
> >>>>>>>>
> >>>>>>>>      WindowInto(Global) | GBK | WindowInto(Day) | GBK
> >>>>>>>>
> >>>>>>>> may produce elements with timestamps greater than MaxTimestamp.
> >>>>>>>>
> >>>>>>>>> Kenn
> >>>>>>>>>
> >>>>>>>>> [1] https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html
> >>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <rob...@frantil.com> 
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> +1 for plan B. Nano second precision on windowing seems... a 
> >>>>>>>>>>>> little much for a system that's aggregating data over time. Even 
> >>>>>>>>>>>> for processing say particle super collider data, they'd get away 
> >>>>>>>>>>>> with artificially increasing the granularity in batch settings.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now if they were streaming... they'd probably want femtoseconds 
> >>>>>>>>>>>> anyway.
> >>>>>>>>>>>> The point is, we should see if users demand it before adding in 
> >>>>>>>>>>>> the necessary work.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath 
> >>>>>>>>>>>> <chamik...@google.com> wrote:
> >>>>>>>>>>>>> +1 for plan B as well. I think it's important to make timestamp 
> >>>>>>>>>>>>> precision consistent now without introducing surprising 
> >>>>>>>>>>>>> behaviors for existing users. But we should move towards a 
> >>>>>>>>>>>>> higher granularity timestamp precision in the long run to 
> >>>>>>>>>>>>> support use-cases that Beam users otherwise might miss out (on 
> >>>>>>>>>>>>> a runner that supports such precision).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Cham
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <lc...@google.com> 
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> I also like Plan B because in the cross language case, the 
> >>>>>>>>>>>>>> pipeline would not work since every party (Runners & SDKs) 
> >>>>>>>>>>>>>> would have to be aware of the new beam:coder:windowed_value:v2 
> >>>>>>>>>>>>>> coder. Plan A has the property where if the SDK/Runner wasn't 
> >>>>>>>>>>>>>> updated then it may start truncating the timestamps 
> >>>>>>>>>>>>>> unexpectedly.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com> 
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Kenn, this discussion is about the precision of the timestamp 
> >>>>>>>>>>>>>>> in the user data. As you had mentioned, Runners need not have 
> >>>>>>>>>>>>>>> the same granularity of user data as long as they correctly 
> >>>>>>>>>>>>>>> round the timestamp to guarantee that triggers are executed 
> >>>>>>>>>>>>>>> correctly but the user data should have the same precision 
> >>>>>>>>>>>>>>> across SDKs otherwise user data timestamps will be truncated 
> >>>>>>>>>>>>>>> in cross language scenarios.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Based on the systems that were listed, either microsecond or 
> >>>>>>>>>>>>>>> nanosecond would make sense. The issue with changing the 
> >>>>>>>>>>>>>>> precision is that all Beam runners except for possibly Beam 
> >>>>>>>>>>>>>>> Python on Dataflow are using millisecond precision since they 
> >>>>>>>>>>>>>>> are all using the same Java Runner windowing/trigger logic.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Plan A: Swap precision to nanosecond
> >>>>>>>>>>>>>>> 1) Change the Python SDK to only expose millisecond precision 
> >>>>>>>>>>>>>>> timestamps (do now)
> >>>>>>>>>>>>>>> 2) Change the user data encoding to support nanosecond 
> >>>>>>>>>>>>>>> precision (do now)
> >>>>>>>>>>>>>>> 3) Swap runner libraries to be nanosecond precision aware 
> >>>>>>>>>>>>>>> updating all window/triggering logic (do later)
> >>>>>>>>>>>>>>> 4) Swap SDKs to expose nanosecond precision (do later)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Plan B:
> >>>>>>>>>>>>>>> 1) Change the Python SDK to only expose millisecond precision 
> >>>>>>>>>>>>>>> timestamps and keep the data encoding as is (do now)
> >>>>>>>>>>>>>>> (We could add greater precision later to plan B by creating a 
> >>>>>>>>>>>>>>> new version beam:coder:windowed_value:v2 which would be 
> >>>>>>>>>>>>>>> nanosecond and would require runners to correctly perform an 
> >>>>>>>>>>>>>>> internal conversions for windowing/triggering.)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think we should go with Plan B and when users request 
> >>>>>>>>>>>>>>> greater precision we can make that an explicit effort. What 
> >>>>>>>>>>>>>>> do people think?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels 
> >>>>>>>>>>>>>>> <m...@apache.org> wrote:
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for taking care of this issue in the Python SDK, 
> >>>>>>>>>>>>>>>> Thomas!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It would be nice to have a uniform precision for timestamps 
> >>>>>>>>>>>>>>>> but, as Kenn
> >>>>>>>>>>>>>>>> pointed out, timestamps are extracted from systems that have 
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>> precision.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To add to the list: Flink - milliseconds
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> After all, it doesn't matter as long as there is sufficient 
> >>>>>>>>>>>>>>>> precision
> >>>>>>>>>>>>>>>> and conversions are done correctly.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think we could improve the situation by at least adding a
> >>>>>>>>>>>>>>>> "milliseconds" constructor to the Python SDK's Timestamp.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Max
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 17.04.19 04:13, Kenneth Knowles wrote:
> >>>>>>>>>>>>>>>>> I am not so sure this is a good idea. Here are some systems 
> >>>>>>>>>>>>>>>>> and their
> >>>>>>>>>>>>>>>>> precision:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Arrow - microseconds
> >>>>>>>>>>>>>>>>> BigQuery - microseconds
> >>>>>>>>>>>>>>>>> New Java instant - nanoseconds
> >>>>>>>>>>>>>>>>> Firestore - microseconds
> >>>>>>>>>>>>>>>>> Protobuf - nanoseconds
> >>>>>>>>>>>>>>>>> Dataflow backend - microseconds
> >>>>>>>>>>>>>>>>> Postgresql - microseconds
> >>>>>>>>>>>>>>>>> Pubsub publish time - nanoseconds
> >>>>>>>>>>>>>>>>> MSSQL datetime2 - 100 nanoseconds (original datetime about 
> >>>>>>>>>>>>>>>>> 3 millis)
> >>>>>>>>>>>>>>>>> Cassandra - milliseconds
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> IMO it is important to be able to treat any of these as a 
> >>>>>>>>>>>>>>>>> Beam
> >>>>>>>>>>>>>>>>> timestamp, even though they aren't all streaming. Who knows 
> >>>>>>>>>>>>>>>>> when we
> >>>>>>>>>>>>>>>>> might be ingesting a streamed changelog, or using them for 
> >>>>>>>>>>>>>>>>> reprocessing
> >>>>>>>>>>>>>>>>> an archived stream. I think for this purpose we either 
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> standardize on nanoseconds or make the runner's resolution 
> >>>>>>>>>>>>>>>>> independent
> >>>>>>>>>>>>>>>>> of the data representation.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I've had some offline conversations about this. I think we 
> >>>>>>>>>>>>>>>>> can have
> >>>>>>>>>>>>>>>>> higher-than-runner precision in the user data, and allow 
> >>>>>>>>>>>>>>>>> WindowFns and
> >>>>>>>>>>>>>>>>> DoFns to operate on this higher-than-runner precision data, 
> >>>>>>>>>>>>>>>>> and still
> >>>>>>>>>>>>>>>>> have consistent watermark treatment. Watermarks are just 
> >>>>>>>>>>>>>>>>> bounds, after all.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Kenn
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise 
> >>>>>>>>>>>>>>>>> <t...@apache.org
> >>>>>>>>>>>>>>>>> <mailto:t...@apache.org>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      The Python SDK currently uses timestamps in 
> >>>>>>>>>>>>>>>>> microsecond resolution
> >>>>>>>>>>>>>>>>>      while Java SDK, as most would probably expect, uses 
> >>>>>>>>>>>>>>>>> milliseconds.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      This causes a few difficulties with portability 
> >>>>>>>>>>>>>>>>> (Python coders need
> >>>>>>>>>>>>>>>>>      to convert to millis for WindowedValue and Timers, 
> >>>>>>>>>>>>>>>>> which is related
> >>>>>>>>>>>>>>>>>      to a bug I'm looking into:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      https://issues.apache.org/jira/browse/BEAM-7035
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      As Luke pointed out, the issue was previously 
> >>>>>>>>>>>>>>>>> discussed:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      https://issues.apache.org/jira/browse/BEAM-1524
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      I'm not privy to the reasons why we decided to go with 
> >>>>>>>>>>>>>>>>> micros in
> >>>>>>>>>>>>>>>>>      first place, but would it be too big of a change or 
> >>>>>>>>>>>>>>>>> impractical for
> >>>>>>>>>>>>>>>>>      other reasons to switch Python SDK to millis before it 
> >>>>>>>>>>>>>>>>> gets more users?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>      Thanks,
> >>>>>>>>>>>>>>>>>      Thomas
> >>>>>>>>>>>>>>>>>

Re: Python SDK timestamp precision

Reply via email to