Thanks Reuven! I assume for other runners the semantics might differ
significantly?

Do you happen to know if the Dataflow storage model is documented anywhere,
either in runner code or in documentation elsewhere?

Thanks again,
Evan

On Sat, Apr 24, 2021 at 12:35 Reuven Lax <[email protected]> wrote:

> In the case of Dataflow, storage is backed by a distributed storage
> system, and this storage is separate from the worker node. Crashing worker
> nodes will not cause data loss.
>
> At the present time though, the storage is tied to a single data center.
>
> Reuven
>
> On Sat, Apr 24, 2021 at 9:19 AM Evan Galpin <[email protected]> wrote:
>
>> Hi all!
>>
>> First off, I apologize for potentially dredging up a topic which has been
>> asked a number of times before. I’m looking for slightly more/different
>> info than I have seen before however:
>>
>> I’ve seen in a number of StackOverflow answers[1][2][3] mention of the
>> phrase “durably committed” in response to questions on the topic of
>> streaming pipelines reading from Unbounded sources like PubSub and Kafka.
>>
>> I’m curious to know more about the cases where “durably committed” data
>> is materialized or, in the case of Dataflow, saved in “Dataflow internal
>> storage” such as when mutating state or running GBK.
>>
>> What durability/redundancy guarantees are there in these cases? Is
>> “Dataflow internal storage” backed by something like Google Cloud Storage?
>> If a pipeline has a single worker node with materialized data in the
>> pipeline which has not yet been written to a Sink, what happens if that
>> singular worker were to crash and vanish? Can data loss occur like this?
>>
>> Thanks!
>> Evan
>>
>> [1]
>> https://stackoverflow.com/a/66338947/6432284
>> [2]
>> https://stackoverflow.com/a/46750189/6432284
>> [3]
>> https://stackoverflow.com/a/37309304/6432284
>>
>

Reply via email to