Hi Community,

BeamSQL currently does not support unbounded-unbounded join with
non-default trigger. It is because:

- Discarding mode does not work for outer joins because of lacking of
ability to retract pre-emitted values. You can think about an example in
which a tuple of (left_row, null) needed to be retracted  if the matched
right_row appears since last trigger fired.
- Accumulating mode *theoretically* can support unbounded-unbounded join
because it's supposed to always "overwrite" previous result. However in
practice, for join use cases such overwriting is too expensive. It would be
much more efficient if small changes in inputs of join only cause small
changes to downstream to compute.
- Both discarding mode and accumulating mode are not sufficient to refine
materialized data.

Meanwhile, [1] has kicked off a discussion on retractions in Beam model. I
have been collecting people's feedback and generally speaking people agree
that retractions are useful for some use cases.

Thus I propose to combine SQL join with retractions to
support multiple-triggering SQL Join.

I think SQL join is a good start for supporting retraction in Beam with the
following caveats:
1. multiple-triggering SQL Join is a useful feature.
2. SQL join is an opportunity for us to figure out implementation details
of retraction by building it for a well defined use case.
3. Supporting retraction should not cause performance regression on
existing pipelines, or require changes on existing pipelines.


What do you think?

[1]:
https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E


-Rui

Reply via email to