Agree. It's predictable or unpredictable errors. It also includes the 
checkpoint case we discussed yesterday.

Thanks for bringing this point.

Regards
JB⁣​

On Jan 19, 2017, 12:07, at 12:07, Stephen Sisk <[email protected]> wrote:
>This is a discussion that I don't think affects any immediate
>decisions,
>but that does inform how folks are writing unit tests, so I wanted to
>give
>it it's own thread.
>
>Ismael mentioned:
>"I am not sure that unit tests are enough to test distribution issues
>because they are harder to simulate in particular if we add the fact
>that
>we can have too many moving pieces. For example, imagine that we run a
>Beam
>pipeline deployed via Spark on a YARN cluster (where some nodes can
>fail)
>that reads from Kafka (with some slow partition) and writes to
>Cassandra
>(with a partition that goes down). You see, this is a quite complex
>combination of pieces (and possible issues), but it is not a totally
>artificial scenario, in fact this is a common architecture, and this
>can
>(at least in theory) be simulated with a cluster manager, but I don’t
>see
>how can I easily reproduce this with a unit test."
>
>I'd like to separate out two scenarios:
>1. Testing for failures we know can occur
>2. Testing for failures we don't realize can occur
>
>For known failure scenarios (#1), we can definitely recreate it with a
>unit
>test - as long as we focus on the code being tested and how those
>failures
>interact with the code being tested. In the case you describe, we can
>think
>through how the failures would surface in the IO code and runner code
>and
>write unit tests for that scenario. That way we don't need to worry
>about
>the combinatorial explosion of kafka failures * spark failures * yarn
>cluster failures * cassandra failures - we can just focus on the
>boundaries
>between those. That is, which of these pieces directly interact, and
>how
>can they surface failures to the other pieces? We then test each of
>those
>individual failures on each particular component (and if useful, the
>combination of failures within a particular piece.)
>
>For example: A cassandraIO test that ensures that if a particular
>worker
>running a BoundedReader/ParDo goes away, the IO performs correctly. We
>don't care whether that happens because of a spark failure or a YARN
>failure - we just know the reader worker went away before committing
>work.
>
>However, I think you are getting at the value that chaos-monkey style
>testing provides: showing failure scenarios that we don't realize can
>occur
>(#2) - in that case, I do agree that having a full stack chaos-monkey
>test
>can help. As you mentioned, that's a good thing to focus on down the
>line.
>I would especially call out that those tests can make a lot of noise
>and
>the failures are hard to investigate. I see them as valuable, but I
>would
>want to consider implementing them after we have proven that we have
>good
>tests for the failure scenarios we do know about. It has also proven
>useful
>to turn the failures found by chaos-monkey testing into concrete unit
>tests
>on the components affected.
>
>S

Reply via email to