On Tue, Mar 6, 2018 at 1:06 PM Shen Li <cs.she...@gmail.com> wrote:
> Should ParDo advance output watermarks based on only main input or all
> inputs? Say if the watermark from a side input falls behind, should it
> block the output watermark of the ParDo.
The rule is that if the user's DoFn might output data with a timestamp,
that timestamp should be a bound on the output watermark. For side inputs,
I don't think this is the case. The readiness of the side input plus the
info in the WindowMappingFn will determine which main elements must be
pushed back, and this will bound the output watermark.
The exception to the rule is that if data is behind the watermark it is
"already late" it is OK to let the watermark advance because it doesn't
make it "more late". Instead, then apply all the same holding rules to GC
time so the data doesn't become droppable. The reason for this is that a
large influx of late data could cause a backlog that prevents more recent
data from achieving good latency.
If there are pushed back elements, should the ParDo hold back its output
> watermarks until corresponding pushed back elements are all processed?
Yes, it should hold the watermark for these.