Hi, Do you have an upper bound on how large the file will become? If it's small enough to fit into a sideinput you may be able to make use of the Slow update sideinput pattern: https://beam.apache.org/documentation/patterns/side-inputs/
If not, then SatefulDoFn would be a good choice, but note a stateful dofn is per key/window. Is there a natural key in the data that you can use ? If yes, something like this pattern may be useful for you use case: streaming-joins-in-a-recommendation-system <https://cloud.google.com/blog/products/data-analytics/data-engineering-lessons-from-google-adsense-using-streaming-joins-in-a-recommendation-system> . In terms of persisting the file, you may want to create a branch in the pipeline and every time you update the file data, write the file out to an object store, which you can read from if the pipeline needs to be restarted or crashes. Cheers Reza On Mon, 16 Nov 2020 at 04:48, Artur Khanin <[email protected]> wrote: > Hi all, > > I am designing a Dataflow pipeline in Java that has to: > > - Read a file (it may be pretty large) during initialization and then > store it in some sort of shared memory > - Periodically update this file > - Make this file available to read across all runner's instances > - Persist this file in cases of restarts/crashes/scale-up/scale down > > > I found some information about stateful processing in Beam using Stateful > DoFn <https://beam.apache.org/blog/stateful-processing/>. Is it an > appropriate way to handle such functionality, or is there a better approach > for it? > > Any help or information is very appreciated! > > Thanks, > Artur Khanin > Akvelon, Inc. > >
