Hi,
I'd like to clarify my understanding. Side inputs generally perform a
left (outer) join, LHS side is the main input, RHS is the side input.
Doing streaming left join requires watermark synchronization, thus
elements from the main input are buffered until main_input_timestamp >
side_input_watermark. When side input watermark reaches max watermark,
main inputs do not have to be buffered because the side input will not
change anymore. This works well for cases when the side input is bounded
or in the case of "slowly changing" patterns (ideally with perfect
watermarks, so no late data present).
Allowing arbitrary changes in the side input (with arbitrary triggers)
might introduce additional questions - how to handle late data in the
side input? Full implementation would require retractions. Dropping late
data does not feel like a solution, because then the pipeline would not
converge to the "correct" solution, as the side input might hold
incorrect value forever. Applying late data from the processing time the
DoFn receives them could make the downstream processing unstable,
restarting the pipeline on errors might change what is "on time" and
what is late thus generate inconsistent different results.
It seems safe to process multiple triggers as long as the trigger does
not produce late data, though (i.e. early emitting). Processing possibly
late data might requires to buffer main input up while
main_input_timestamp > side_input_watermark - allowed_lateness.
Is my line of thinking correct?
Jan
On 3/23/23 20:19, Kenneth Knowles wrote:
Hi all,
I had a great chat with +Reza Rokni <mailto:rezaro...@google.com> and
+Reuven Lax <mailto:re...@google.com> yesterday about some
inconsistencies in side input behavior, both before and after
portability was introduced.
I wrote up my thoughts about how we should specify the semantics and
implement them:
https://s.apache.org/beam-triggered-side-inputs
I think I found some issues that I think may require changes in the
portability protocols to get consistent behavior.
Please take a look and find my errors and oversights!
Kenn