Hi,

this is follow-up thread started from [1]. In the thread there is mentioned multiple times that (in stateless ParDo), the output watermark is allowed to advance only on bundle boundaries [2]. Essentially that would mean that anything in between calls to @StartBundle and @FinishBundle would be processed in single instant in (output) event-time. This makes perfect sense.

The issue is that it seems that not all runners actually implement this behavior. FlinkRunner for instance does not have a "natural" concept of bundles and those are created in a more ad-hoc way to adhere with the DoFn life-cycle (see [3]). Watermark updates and elements are completely interleaved without any synchronization with bundle "open" or "close". If watermark updates are allowed to happen only on boundaries of bundles, then this seems to break this contract.

The question therefore is - should we consider FlinkRunner as non-compliant with this aspect of the Apache Beam model or is this an "optional" part that runners are free to implement at will? In the case of the former, do we miss some @ValidatesRunner tests for this?

 Jan

[1] https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db

[2] https://lists.apache.org/thread/7foy455spg43xo77zhrs62gc1m383t50

[3] https://github.com/apache/beam/blob/14862ccbdf2879574b6ce49149bdd7c9bf197322/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L786

Reply via email to