Re: Question about saving data to use across runner's instances

Reza Ardeshir Rokni Mon, 16 Nov 2020 02:44:29 -0800

Hi,

Do you have an upper bound on how large the file will become?  If
it's small enough to fit into a sideinput you may be able to make use of
the Slow update sideinput pattern:
https://beam.apache.org/documentation/patterns/side-inputs/

If not, then SatefulDoFn would be a good choice, but note a stateful dofn
is per key/window. Is there a natural key in the data that you can use ? If
yes, something like this pattern may be useful for you use case:
streaming-joins-in-a-recommendation-system
<https://cloud.google.com/blog/products/data-analytics/data-engineering-lessons-from-google-adsense-using-streaming-joins-in-a-recommendation-system>
.

In terms of persisting the file, you may want to create a branch in the
pipeline and every time you update the file data, write the file out to an
object store, which you can read from if the pipeline needs to be restarted
or crashes.

Cheers
Reza

On Mon, 16 Nov 2020 at 04:48, Artur Khanin <[email protected]> wrote:

> Hi all,
>
> I am designing a Dataflow pipeline in Java that has to:
>
>    - Read a file (it may be pretty large) during initialization and then
>    store it in some sort of shared memory
>    - Periodically update this file
>    - Make this file available to read across all runner's instances
>    - Persist this file in cases of restarts/crashes/scale-up/scale down
>
>
> I found some information about stateful processing in Beam using Stateful
> DoFn <https://beam.apache.org/blog/stateful-processing/>. Is it an
> appropriate way to handle such functionality, or is there a better approach
> for it?
>
> Any help or information is very appreciated!
>
> Thanks,
> Artur Khanin
> Akvelon, Inc.
>
>

Re: Question about saving data to use across runner's instances

Reply via email to