Hi Mohil, Please, take a look at. https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs
Also, I have design doc out that handles similar case. I'm working on prototyping it in python atm. https://lists.apache.org/thread.html/r792fcf4b6adbce79ea1eb81592d29a3cee7aef768ba4615ac2d078ad%40%3Cdev.beam.apache.org%3E Regards, --Mikhail On Mon, Jan 27, 2020 at 8:56 AM Mohil Khare <[email protected]> wrote: > Hi, > This is Mohil Khare from San Jose, California. I work in an early stage > startup: Prosimo. > We use Apache beam with gcp dataflow for all real time stats processing > with Kafka and Pubsub as data source while elasticsearch and GCS as sinks. > > I am trying to solve the following use with sideinputs. > > INPUT: > 1. We have a continuous stream of data coming from pubsub topicA. This > data can be put in KV Pcollection and each data item can be uniquely > identified with certain key. > 2. We have a very slow changing stream of data coming from pubsub topicB > i.e. you can say that stream of data comes for few mins on topicB followed > by no activity for a long time period. This stream of data can be again > put in KV PCollection with same keys as above. NOTE: after long inactivity, > it is possible that data comes for only certain keys. > > DESIRED OUTPUT/PROCESSING: > 1. I want to use KV PCollection as sideinput to enrich data arriving in > topicA. I think View.asMap can be a good choice for it. > 2. After enriching data in topic A using sideinput data from topic B, > write to GCS in a fixed window of 10 minutes > 2. Want to continue using above PCollectionView as sideinput as long as > no new data arrives in topicB. > 3. Whenever new data arrives in topicB, want to update PCollectionView Map > only for set of Keys that arrived in new stream. > > My question is what should be the best approach to tackle this use case? I > will really appreciate if someone can suggest some good solution. > > Thanks and Regards > Mohil Khare > > > > >
