damccorm opened a new issue, #20310: URL: https://github.com/apache/beam/issues/20310
I have a use case which I think might be a good addition to the pipelines patterns: beam (java sdk) reads two kind of records from data stream like Kafka: 1. Records of type A containing key and corresponding metadata. 2. Records of type B containing the same key, but no metadata. Beam then needs to fill metadata for records of type B by doing a lookup for metadata using keys received in records of type A. Idea is to save metadata or rather state for keys received in records of type A and then do a lookup when records of type B are received. Beam's "@State" construct can be used here, however, problem is that we don't know when keys should expire. I don't think keeping a global window will be a good idea as there could be many keys (may be millions over a period of time) to be saved in a state. One possible solution as suggested by Reza Ardeshir Rokni ([email protected]): We can maintain a state in a large fixed window (1 day or so), so that GC can happen within a window bound. After window expire, save the metadata values in an external DB like BigQuery. If we get a record with same key in a new window looking for this metadata, fetch the metadata for that key from external DB and save it in window's state again. Imported from Jira [BEAM-10019](https://issues.apache.org/jira/browse/BEAM-10019). Original Jira may contain additional context. Reported by: mohilkhare. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
