Agree. It's predictable or unpredictable errors. It also includes the checkpoint case we discussed yesterday.
Thanks for bringing this point. Regards JB On Jan 19, 2017, 12:07, at 12:07, Stephen Sisk <[email protected]> wrote: >This is a discussion that I don't think affects any immediate >decisions, >but that does inform how folks are writing unit tests, so I wanted to >give >it it's own thread. > >Ismael mentioned: >"I am not sure that unit tests are enough to test distribution issues >because they are harder to simulate in particular if we add the fact >that >we can have too many moving pieces. For example, imagine that we run a >Beam >pipeline deployed via Spark on a YARN cluster (where some nodes can >fail) >that reads from Kafka (with some slow partition) and writes to >Cassandra >(with a partition that goes down). You see, this is a quite complex >combination of pieces (and possible issues), but it is not a totally >artificial scenario, in fact this is a common architecture, and this >can >(at least in theory) be simulated with a cluster manager, but I don’t >see >how can I easily reproduce this with a unit test." > >I'd like to separate out two scenarios: >1. Testing for failures we know can occur >2. Testing for failures we don't realize can occur > >For known failure scenarios (#1), we can definitely recreate it with a >unit >test - as long as we focus on the code being tested and how those >failures >interact with the code being tested. In the case you describe, we can >think >through how the failures would surface in the IO code and runner code >and >write unit tests for that scenario. That way we don't need to worry >about >the combinatorial explosion of kafka failures * spark failures * yarn >cluster failures * cassandra failures - we can just focus on the >boundaries >between those. That is, which of these pieces directly interact, and >how >can they surface failures to the other pieces? We then test each of >those >individual failures on each particular component (and if useful, the >combination of failures within a particular piece.) > >For example: A cassandraIO test that ensures that if a particular >worker >running a BoundedReader/ParDo goes away, the IO performs correctly. We >don't care whether that happens because of a spark failure or a YARN >failure - we just know the reader worker went away before committing >work. > >However, I think you are getting at the value that chaos-monkey style >testing provides: showing failure scenarios that we don't realize can >occur >(#2) - in that case, I do agree that having a full stack chaos-monkey >test >can help. As you mentioned, that's a good thing to focus on down the >line. >I would especially call out that those tests can make a lot of noise >and >the failures are hard to investigate. I see them as valuable, but I >would >want to consider implementing them after we have proven that we have >good >tests for the failure scenarios we do know about. It has also proven >useful >to turn the failures found by chaos-monkey testing into concrete unit >tests >on the components affected. > >S
