This is a discussion that I don't think affects any immediate decisions, but that does inform how folks are writing unit tests, so I wanted to give it it's own thread.
Ismael mentioned: "I am not sure that unit tests are enough to test distribution issues because they are harder to simulate in particular if we add the fact that we can have too many moving pieces. For example, imagine that we run a Beam pipeline deployed via Spark on a YARN cluster (where some nodes can fail) that reads from Kafka (with some slow partition) and writes to Cassandra (with a partition that goes down). You see, this is a quite complex combination of pieces (and possible issues), but it is not a totally artificial scenario, in fact this is a common architecture, and this can (at least in theory) be simulated with a cluster manager, but I don’t see how can I easily reproduce this with a unit test." I'd like to separate out two scenarios: 1. Testing for failures we know can occur 2. Testing for failures we don't realize can occur For known failure scenarios (#1), we can definitely recreate it with a unit test - as long as we focus on the code being tested and how those failures interact with the code being tested. In the case you describe, we can think through how the failures would surface in the IO code and runner code and write unit tests for that scenario. That way we don't need to worry about the combinatorial explosion of kafka failures * spark failures * yarn cluster failures * cassandra failures - we can just focus on the boundaries between those. That is, which of these pieces directly interact, and how can they surface failures to the other pieces? We then test each of those individual failures on each particular component (and if useful, the combination of failures within a particular piece.) For example: A cassandraIO test that ensures that if a particular worker running a BoundedReader/ParDo goes away, the IO performs correctly. We don't care whether that happens because of a spark failure or a YARN failure - we just know the reader worker went away before committing work. However, I think you are getting at the value that chaos-monkey style testing provides: showing failure scenarios that we don't realize can occur (#2) - in that case, I do agree that having a full stack chaos-monkey test can help. As you mentioned, that's a good thing to focus on down the line. I would especially call out that those tests can make a lot of noise and the failures are hard to investigate. I see them as valuable, but I would want to consider implementing them after we have proven that we have good tests for the failure scenarios we do know about. It has also proven useful to turn the failures found by chaos-monkey testing into concrete unit tests on the components affected. S
