Hi,
this is follow-up thread started from [1]. In the thread there is
mentioned multiple times that (in stateless ParDo), the output watermark
is allowed to advance only on bundle boundaries [2]. Essentially that
would mean that anything in between calls to @StartBundle and
@FinishBundle would be processed in single instant in (output)
event-time. This makes perfect sense.
The issue is that it seems that not all runners actually implement this
behavior. FlinkRunner for instance does not have a "natural" concept of
bundles and those are created in a more ad-hoc way to adhere with the
DoFn life-cycle (see [3]). Watermark updates and elements are completely
interleaved without any synchronization with bundle "open" or "close".
If watermark updates are allowed to happen only on boundaries of
bundles, then this seems to break this contract.
The question therefore is - should we consider FlinkRunner as
non-compliant with this aspect of the Apache Beam model or is this an
"optional" part that runners are free to implement at will? In the case
of the former, do we miss some @ValidatesRunner tests for this?
Jan
[1] https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db
[2] https://lists.apache.org/thread/7foy455spg43xo77zhrs62gc1m383t50
[3]
https://github.com/apache/beam/blob/14862ccbdf2879574b6ce49149bdd7c9bf197322/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L786