Hi, This is Mohil Khare from San Jose, California. I work in an early stage startup: Prosimo. We use Apache beam with gcp dataflow for all real time stats processing with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.
I am trying to solve the following use with sideinputs. INPUT: 1. We have a continuous stream of data coming from pubsub topicA. This data can be put in KV Pcollection and each data item can be uniquely identified with certain key. 2. We have a very slow changing stream of data coming from pubsub topicB i.e. you can say that stream of data comes for few mins on topicB followed by no activity for a long time period. This stream of data can be again put in KV PCollection with same keys as above. NOTE: after long inactivity, it is possible that data comes for only certain keys. DESIRED OUTPUT/PROCESSING: 1. I want to use KV PCollection as sideinput to enrich data arriving in topicA. I think View.asMap can be a good choice for it. 2. After enriching data in topic A using sideinput data from topic B, write to GCS in a fixed window of 10 minutes 2. Want to continue using above PCollectionView as sideinput as long as no new data arrives in topicB. 3. Whenever new data arrives in topicB, want to update PCollectionView Map only for set of Keys that arrived in new stream. My question is what should be the best approach to tackle this use case? I will really appreciate if someone can suggest some good solution. Thanks and Regards Mohil Khare
