Re: dealing with late data output timestamps

Jan Lukavský Fri, 29 May 2020 00:54:50 -0700

Hi,

what seems the most "surprising" to me is that we are usingTimestampCombiners to actually do two (orthogonal) things:

a) calculate a watermark hold for a window, so on-time elementsemitted from a pane are not late in downstream processing


 b) calculate timestamp of elements in output pane

These two follow a little different constraints - while in case a) it isnot allowed to shift watermark "back in time" in case b) it seems OK tooutput data with timestamp lower than output watermark (what comes late,might leave late). So, while it seems OK to discard late elements forthe sake of calculation output watermark, it seems wrong to discard themwhen calculating output timestamp. Maybe these two timestamps might beheld in different states (the state will be held until GC time foraccumulating panes and reset on discarding panes)?


Jan

On 5/28/20 5:02 PM, David Morávek wrote:

Hi,
I've came across "unexpected" model behaviour when dealing with latedata and custom timestamp combiners. Let's take a following pipelineas an example:
final PCollection<String> input = ...;
input.apply(
      "GlobalWindows",
      Window.<String>into(new GlobalWindows())
          .triggering(
              AfterWatermark.pastEndOfWindow()
                  .withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))))
          .withTimestampCombiner(TimestampCombiner.LATEST)
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_IF_NON_EMPTY)
          .accumulatingFiredPanes())
  .apply("Aggregate", Count.perElement())
The above pipeline emits updates with the latest input timestamp ithas seen so far (from non-late elements). We write the output fromthis timestamp to kafka and read it from another pipeline.
Problem comes when we need to handle late elements behind outputwatermark. In this case beam can not use combined timestamp and usesEOW timestamp instead. Unfortunately this results in downstreampipeline progressing it's input watermark to end of global window.Also if we would use fixed windows after this aggregation, it wouldyield unexpected results.
There is no reasoning about this behaviour in the last section oflateness design doc <https://s.apache.org/beam-lateness> [1], so I'dlike to open a discussion about what the expected result should be.
My personal opinion is, that correct approach would be emitting lateelements with currentOutputWatermark rather than EOW in case ofEARLIEST and LATEST timestamp combiners.
I've prepared a faling test case for ReduceFnRunner<https://github.com/dmvk/beam/commit/c93cd26681aa6fbc83c15d2a7a8146287f1e850b>,if anyone wants to play around with the issue.
I also think that BEAM-2262<https://issues.apache.org/jira/browse/BEAM-2262> [2] may be relatedto this discussion.
[1] https://s.apache.org/beam-lateness
[2] https://issues.apache.org/jira/browse/BEAM-2262
[3]https://github.com/dmvk/beam/commit/c93cd26681aa6fbc83c15d2a7a8146287f1e850b
Looking forward to hearing your thoughts.

Thanks,
D.

Re: dealing with late data output timestamps

Reply via email to