Hi Mohil, I've responded to another thread you started. Please, check on: https://lists.apache.org/thread.html/r93066c4a3daefba954b23dbdb4536159ff3ff29b06052b92e747da17%40%3Cdev.beam.apache.org%3E
Regards, Mikhail. On Tue, Jan 28, 2020 at 10:22 AM Mohil Khare <[email protected]> wrote: > Hi, > This is Mohil Khare from San Jose, California. I work in an early stage > startup: Prosimo. > We use Apache beam with gcp dataflow for all real time stats processing > with Kafka and Pubsub as data source while elasticsearch and GCS as sinks. > > I am trying to solve the following use with sideinputs. > > INPUT: > 1. We have a continuous stream of data coming from pubsub topicA. This > data can be put in KV Pcollection and each data item can be uniquely > identified with certain key. > 2. We have a very slow changing stream of data coming from pubsub topicB > i.e. you can say that stream of data comes for few mins on topicB followed > by no activity for a long time period. This stream of data can be again > put in KV PCollection with same keys as above. NOTE: after long inactivity, > it is possible that data comes for only certain keys. > > DESIRED OUTPUT/PROCESSING: > 1. I want to use KV PCollection as sideinput to enrich data arriving in > topicA. I think View.asMap can be a good choice for it. > 2. After enriching data in topic A using sideinput data from topic B, > write to GCS in a fixed window of 10 minutes > 2. Want to continue using above PCollectionView as sideinput as long as > no new data arrives in topicB. > 3. Whenever new data arrives in topicB, want to update PCollectionView Map > only for set of Keys that arrived in new stream. > > My question is what should be the best approach to tackle this use case? I > will really appreciate if someone can suggest some good solution. > > Thanks and Regards > Mohil Khare >
