IO testing: failure scenarios

Stephen Sisk Thu, 19 Jan 2017 12:07:44 -0800

This is a discussion that I don't think affects any immediate decisions,
but that does inform how folks are writing unit tests, so I wanted to give
it it's own thread.


Ismael mentioned:
"I am not sure that unit tests are enough to test distribution issues
because they are harder to simulate in particular if we add the fact that
we can have too many moving pieces. For example, imagine that we run a Beam
pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
that reads from Kafka (with some slow partition) and writes to Cassandra
(with a partition that goes down). You see, this is a quite complex
combination of pieces (and possible issues), but it is not a totally
artificial scenario, in fact this is a common architecture, and this can
(at least in theory) be simulated with a cluster manager, but I don’t see
how can I easily reproduce this with a unit test."

I'd like to separate out two scenarios:
1. Testing for failures we know can occur
2. Testing for failures we don't realize can occur

For known failure scenarios (#1), we can definitely recreate it with a unit
test - as long as we focus on the code being tested and how those failures
interact with the code being tested. In the case you describe, we can think
through how the failures would surface in the IO code and runner code and
write unit tests for that scenario. That way we don't need to worry about
the combinatorial explosion of kafka failures * spark failures * yarn
cluster failures * cassandra failures - we can just focus on the boundaries
between those. That is, which of these pieces directly interact, and how
can they surface failures to the other pieces? We then test each of those
individual failures on each particular component (and if useful, the
combination of failures within a particular piece.)

For example: A cassandraIO test that ensures that if a particular worker
running a BoundedReader/ParDo goes away, the IO performs correctly. We
don't care whether that happens because of a spark failure or a YARN
failure - we just know the reader worker went away before committing work.

However, I think you are getting at the value that chaos-monkey style
testing provides: showing failure scenarios that we don't realize can occur
(#2) - in that case, I do agree that having a full stack chaos-monkey test
can help. As you mentioned, that's a good thing to focus on down the line.
I would especially call out that those tests can make a lot of noise and
the failures are hard to investigate. I see them as valuable, but I would
want to consider implementing them after we have proven that we have good
tests for the failure scenarios we do know about. It has also proven useful
to turn the failures found by chaos-monkey testing into concrete unit tests
on the components affected.

S

IO testing: failure scenarios

Reply via email to