Actually, Reuven, that's no longer the case. It used to be that incoming data was compared to the watermark but it is not today. Instead, Jeff's first phrasing is perfect.
One way to see it is the think about what are the consequences of late data: if there is a grouping/aggregation by key+window, the window determines when the grouping is complete. We go ahead and include any data that shows up before the window is complete. And if you set up allowed lateness it matches exactly: any data that arrives before the ON_TIME output gets to be in that output. Previously, when we compared incoming elements to the watermark directly, you could have a window that was still being aggregated but the elements that fell in the window were dropped. There was no technical benefit to losing this data, so we stopped dropping it. We also had lots of tricky bugs and hard-to-manage code related to what we do if an element arrives after the watermark. And you could have an ON_TIME firing that included a bunch of "late" data which is confusing. Now it is simple: if the window is still alive, the element goes into it. I very rarely use the term "late data" when describing Beam's semantics anyhow. I always found the term / definition a bit arbitrary. Kenn On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ruw...@google.com> wrote: > I created this PR: https://github.com/apache/beam/pull/7556 > > Feel free to review/comment it. > > -Rui > > On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ruw...@google.com> wrote: > >> It might be better to keep something like "watermark usually consistently >> moves forward". But "Elements that arrive with a smaller timestamp than the >> current watermark are considered late data." has already given the order of >> late data ts and watermark. >> >> >> -Rui >> >> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jklu...@mozilla.com> wrote: >> >>> Reuven - I don't think I realized it was possible to have late data with >>> the global window, so I'm definitely learning things through this >>> discussion. >>> >>> New suggested wording, then: >>> >>> Elements that arrive with a smaller timestamp than the current >>> watermark are considered late data. >>> >>> That says basically the same thing as the wording currently in the >>> guide, but uses "smaller" (which implies a less-than-watermark comparison) >>> rather than "later" (which folks have interpreted as a >>> greater-than-watermark comparison). >>> >>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote: >>> >>>> Though it's not tied to window. You could be in the global window, so >>>> the watermark never advances past the end of the window, yet still get late >>>> data. >>>> >>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklu...@mozilla.com wrote: >>>> >>>>> How about: "Once the watermark progresses past the end of a window, >>>>> any further elements that arrive with a timestamp in that window are >>>>> considered late data." >>>>> >>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ruw...@google.com> wrote: >>>>> >>>>>> Hi Community, >>>>>> >>>>>> In Beam programming guide [1], there is a sentence: "Data that >>>>>> arrives with a timestamp after the watermark is considered *late >>>>>> data*" >>>>>> >>>>>> Seems like people get confused by it. For example, see Stackoverflow >>>>>> comment [2]. Basically it makes people think that a event timestamp that >>>>>> is >>>>>> bigger than watermark is considered late (due to that "after"). >>>>>> >>>>>> Although there is a example right after this sentence to explain late >>>>>> data, seems to me that this sentence is incomplete. The complete sentence >>>>>> to me can be: "The watermark consistently advances from -inf to +inf. >>>>>> Data >>>>>> that arrives with a timestamp after the watermark is considered late >>>>>> data." >>>>>> >>>>>> Am I understand correctly? Is there better description for the order >>>>>> of late data and watermark? I would happy to send PR to update Beam >>>>>> documentation. >>>>>> >>>>>> -Rui >>>>>> >>>>>> [1]: >>>>>> https://beam.apache.org/documentation/programming-guide/#windowing >>>>>> [2]: >>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971 >>>>>> >>>>>> >>>>>>