Using very slow changing stream of data (KV) as side input

Mohil Khare Tue, 28 Jan 2020 10:23:21 -0800

Hi,
This is Mohil Khare from San Jose, California. I work in an early stage
startup: Prosimo.
We use Apache beam with gcp dataflow for all real time stats processing
with Kafka and Pubsub as data source while elasticsearch and GCS as sinks.


I am trying to solve the following use with sideinputs.

INPUT:
1. We have a continuous stream of data coming from pubsub topicA. This data
can be put in KV Pcollection and each data item can be uniquely identified
with certain key.
2. We have a very slow changing stream of data coming from pubsub topicB
i.e. you can say that stream of data comes for few mins on topicB followed
by no activity for a long time period.   This stream of data can be again
put in KV PCollection with same keys as above. NOTE: after long inactivity,
it is possible that data comes for only certain keys.

DESIRED OUTPUT/PROCESSING:
1. I want to use KV PCollection as sideinput to enrich data arriving in
topicA. I think View.asMap can be a good choice for it.
2. After enriching data in topic A using sideinput data from topic B, write
to GCS in a fixed window of 10 minutes
2.  Want to continue using above PCollectionView as sideinput as long as no
new data arrives in topicB.
3. Whenever new data arrives in topicB, want to update PCollectionView Map
only for set of Keys that arrived in new stream.

My question is what should be the best approach to tackle this use case? I
will really appreciate if someone can suggest some good solution.

Thanks and Regards
Mohil Khare

Using very slow changing stream of data (KV) as side input

Reply via email to