Hya,

I never got a chance to finish this one, maybe I will get some time in the
summer break... but I think it will help with your use case...

https://github.com/rezarokni/beam/blob/BEAM-7386/sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java

Cheers
Reza

On Fri, Jul 10, 2020 at 8:58 AM Harrison Green <hgarrer...@google.com>
wrote:

> Hi Beam devs,
>
> I'm working on a streaming pipeline where we need to do a "most-recent"
> join between two PCollections. Specifically, something like:
>
> out = pcoll1 | beam.Map(lambda a,b: (a,b),
> b=beam.pvalue.AsSingleton(pcoll2))
>
> The goal is to join each value in pcoll1 with only the most recent value
> from pcoll2. (in this case pcoll2 is much more sparse than pcoll1)
> ---
> altay@ suggested using a global window for the side-input pcollection
> with a trigger on each element. I've been trying to simulate this behavior
> locally with beam.testing.TestStream but I've been running into some issues.
>
> Specifically, the Repeatedly(AfterCount(1)) trigger seems to work
> correctly, but the side input receives too many panes (even when using
> discarding accumulation). I've set up a minimal demo here:
> https://colab.research.google.com/drive/1K0EqcKWxa4UK3SrkLBeHs7HSynw_VfSZ?usp=sharing
> In this example, I'm trying to join values from pcollection "a" with
> pcollection "b". However each pane of pcollection "a" is able to "see" all
> of the panes from pcollection "b" which is not what I would expect.
>
> I am curious if anyone has advice for how to handle this type of problem
> or an alternative solution for the "most-recent" join. (side note: I was
> able to hack together an alternative solution that uses a custom
> window/windowing strategy but it was fairly complex and I think a strategy
> that uses GlobalWindows would be preferred).
>
> Sincerely,
> Harrison
>

Reply via email to