Hi all! First off, I apologize for potentially dredging up a topic which has been asked a number of times before. I’m looking for slightly more/different info than I have seen before however:
I’ve seen in a number of StackOverflow answers[1][2][3] mention of the phrase “durably committed” in response to questions on the topic of streaming pipelines reading from Unbounded sources like PubSub and Kafka. I’m curious to know more about the cases where “durably committed” data is materialized or, in the case of Dataflow, saved in “Dataflow internal storage” such as when mutating state or running GBK. What durability/redundancy guarantees are there in these cases? Is “Dataflow internal storage” backed by something like Google Cloud Storage? If a pipeline has a single worker node with materialized data in the pipeline which has not yet been written to a Sink, what happens if that singular worker were to crash and vanish? Can data loss occur like this? Thanks! Evan [1] https://stackoverflow.com/a/66338947/6432284 [2] https://stackoverflow.com/a/46750189/6432284 [3] https://stackoverflow.com/a/37309304/6432284
