Hi Mohil,

Please, take a look at.
https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs


Also, I have design doc out that handles similar case. I'm working on
prototyping it in python atm.
https://lists.apache.org/thread.html/r792fcf4b6adbce79ea1eb81592d29a3cee7aef768ba4615ac2d078ad%40%3Cdev.beam.apache.org%3E


Regards,
--Mikhail

On Mon, Jan 27, 2020 at 8:56 AM Mohil Khare <[email protected]> wrote:

> Hi,
> This is Mohil Khare from San Jose, California. I work in an early stage
> startup: Prosimo.
> We use Apache beam with gcp dataflow for all real time stats processing
> with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.
>
> I am trying to solve the following use with sideinputs.
>
> INPUT:
> 1. We have a continuous stream of data coming from pubsub topicA. This
> data can be put in KV Pcollection and each data item can be uniquely
> identified with certain key.
> 2. We have a very slow changing stream of data coming from pubsub topicB
> i.e. you can say that stream of data comes for few mins on topicB followed
> by no activity for a long time period.   This stream of data can be again
> put in KV PCollection with same keys as above. NOTE: after long inactivity,
> it is possible that data comes for only certain keys.
>
> DESIRED OUTPUT/PROCESSING:
> 1. I want to use KV PCollection as sideinput to enrich data arriving in
> topicA. I think View.asMap can be a good choice for it.
> 2. After enriching data in topic A using sideinput data from topic B,
> write to GCS in a fixed window of 10 minutes
> 2.  Want to continue using above PCollectionView as sideinput as long as
> no new data arrives in topicB.
> 3. Whenever new data arrives in topicB, want to update PCollectionView Map
> only for set of Keys that arrived in new stream.
>
> My question is what should be the best approach to tackle this use case? I
> will really appreciate if someone can suggest some good solution.
>
> Thanks and Regards
> Mohil Khare
>
>
>
>
>

Reply via email to