Hi,

I'd like to clarify my understanding. Side inputs generally perform a left (outer) join, LHS side is the main input, RHS is the side input. Doing streaming left join requires watermark synchronization, thus elements from the main input are buffered until main_input_timestamp > side_input_watermark. When side input watermark reaches max watermark, main inputs do not have to be buffered because the side input will not change anymore. This works well for cases when the side input is bounded or in the case of "slowly changing" patterns (ideally with perfect watermarks, so no late data present).

Allowing arbitrary changes in the side input (with arbitrary triggers) might introduce additional questions - how to handle late data in the side input? Full implementation would require retractions. Dropping late data does not feel like a solution, because then the pipeline would not converge to the "correct" solution, as the side input might hold incorrect value forever. Applying late data from the processing time the DoFn receives them could make the downstream processing unstable, restarting the pipeline on errors might change what is "on time" and what is late thus generate inconsistent different results.

It seems safe to process multiple triggers as long as the trigger does not produce late data, though (i.e. early emitting). Processing possibly late data might requires to buffer main input up while main_input_timestamp > side_input_watermark - allowed_lateness.

Is my line of thinking correct?

 Jan

On 3/23/23 20:19, Kenneth Knowles wrote:
Hi all,

I had a great chat with +Reza Rokni <mailto:rezaro...@google.com> and +Reuven Lax <mailto:re...@google.com> yesterday about some inconsistencies in side input behavior, both before and after portability was introduced.

I wrote up my thoughts about how we should specify the semantics and implement them:

https://s.apache.org/beam-triggered-side-inputs

I think I found some issues that I think may require changes in the portability protocols to get consistent behavior.

Please take a look and find my errors and oversights!

Kenn

Reply via email to