Hya, I never got a chance to finish this one, maybe I will get some time in the summer break... but I think it will help with your use case...
https://github.com/rezarokni/beam/blob/BEAM-7386/sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java Cheers Reza On Fri, Jul 10, 2020 at 8:58 AM Harrison Green <hgarrer...@google.com> wrote: > Hi Beam devs, > > I'm working on a streaming pipeline where we need to do a "most-recent" > join between two PCollections. Specifically, something like: > > out = pcoll1 | beam.Map(lambda a,b: (a,b), > b=beam.pvalue.AsSingleton(pcoll2)) > > The goal is to join each value in pcoll1 with only the most recent value > from pcoll2. (in this case pcoll2 is much more sparse than pcoll1) > --- > altay@ suggested using a global window for the side-input pcollection > with a trigger on each element. I've been trying to simulate this behavior > locally with beam.testing.TestStream but I've been running into some issues. > > Specifically, the Repeatedly(AfterCount(1)) trigger seems to work > correctly, but the side input receives too many panes (even when using > discarding accumulation). I've set up a minimal demo here: > https://colab.research.google.com/drive/1K0EqcKWxa4UK3SrkLBeHs7HSynw_VfSZ?usp=sharing > In this example, I'm trying to join values from pcollection "a" with > pcollection "b". However each pane of pcollection "a" is able to "see" all > of the panes from pcollection "b" which is not what I would expect. > > I am curious if anyone has advice for how to handle this type of problem > or an alternative solution for the "most-recent" join. (side note: I was > able to hack together an alternative solution that uses a custom > window/windowing strategy but it was fairly complex and I think a strategy > that uses GlobalWindows would be preferred). > > Sincerely, > Harrison >