In the case of Dataflow, storage is backed by a distributed storage system, and this storage is separate from the worker node. Crashing worker nodes will not cause data loss.
At the present time though, the storage is tied to a single data center. Reuven On Sat, Apr 24, 2021 at 9:19 AM Evan Galpin <[email protected]> wrote: > Hi all! > > First off, I apologize for potentially dredging up a topic which has been > asked a number of times before. I’m looking for slightly more/different > info than I have seen before however: > > I’ve seen in a number of StackOverflow answers[1][2][3] mention of the > phrase “durably committed” in response to questions on the topic of > streaming pipelines reading from Unbounded sources like PubSub and Kafka. > > I’m curious to know more about the cases where “durably committed” data is > materialized or, in the case of Dataflow, saved in “Dataflow internal > storage” such as when mutating state or running GBK. > > What durability/redundancy guarantees are there in these cases? Is > “Dataflow internal storage” backed by something like Google Cloud Storage? > If a pipeline has a single worker node with materialized data in the > pipeline which has not yet been written to a Sink, what happens if that > singular worker were to crash and vanish? Can data loss occur like this? > > Thanks! > Evan > > [1] > https://stackoverflow.com/a/66338947/6432284 > [2] > https://stackoverflow.com/a/46750189/6432284 > [3] > https://stackoverflow.com/a/37309304/6432284 >
