In the case of Dataflow, storage is backed by a distributed storage system,
and this storage is separate from the worker node. Crashing worker nodes
will not cause data loss.

At the present time though, the storage is tied to a single data center.

Reuven

On Sat, Apr 24, 2021 at 9:19 AM Evan Galpin <[email protected]> wrote:

> Hi all!
>
> First off, I apologize for potentially dredging up a topic which has been
> asked a number of times before. I’m looking for slightly more/different
> info than I have seen before however:
>
> I’ve seen in a number of StackOverflow answers[1][2][3] mention of the
> phrase “durably committed” in response to questions on the topic of
> streaming pipelines reading from Unbounded sources like PubSub and Kafka.
>
> I’m curious to know more about the cases where “durably committed” data is
> materialized or, in the case of Dataflow, saved in “Dataflow internal
> storage” such as when mutating state or running GBK.
>
> What durability/redundancy guarantees are there in these cases? Is
> “Dataflow internal storage” backed by something like Google Cloud Storage?
> If a pipeline has a single worker node with materialized data in the
> pipeline which has not yet been written to a Sink, what happens if that
> singular worker were to crash and vanish? Can data loss occur like this?
>
> Thanks!
> Evan
>
> [1]
> https://stackoverflow.com/a/66338947/6432284
> [2]
> https://stackoverflow.com/a/46750189/6432284
> [3]
> https://stackoverflow.com/a/37309304/6432284
>

Reply via email to