On CORDS for Disk Failure Testing

Andrew Wong Tue, 11 Jul 2017 12:21:00 -0700

Hi guys,

I've been working on running Kudu with CORDS
<https://github.com/aganesan4/CORDS>, a research project that was used to
inject disk-level failures in various distributed systems (Kafka,
Zookeeper, etc.).


They approached this by mounting these projects on a filesystem that could
inject various errors and corruptions to a single block. While this may be
useful, I don't think we need this for testing disk failures. A few reasons
for this:

   - The disk failure injection in some of my (in-review) tests is already
   quite similar to the injection here, but at a file-level instead of a block
   level
   - Their tests equate to writing up a bunch of test scenarios and
   examining the outputs (via grep, sed, etc.) when running with faults
   injected. We should be able to achieve this ourselves with integration tests
   - The end-goal of running with CORDS is to see what happens when various
   disk blocks fail I/O; we should already (and in the future) have a solid
   understanding of what happens when these failures happen (and we should be
   testing for them ourselves)
   - May be nitpick-y, but their code for injection is poorly commented and
   may be a hassle to maintain if we were to, say, run this as a recurring
   job. It's not particularly hard to read through but it might be annoying
   debugging errors from their end.
   - Another thing I thought was a bit odd was not all of their code is
   available for other systems (e.g. they only posted their code for ZooKeeper
   and not Kafka, MongoDB, Redis, etc.)

That isn't to say that I don't think there is value in running CORDS. It
would be a nice check to make sure we're doing the right thing. However,
the major thought I had from this is that we can do a lot of this testing
ourselves (and perhaps even improve our testing/debugging infrastructure
for more end-to-end tests).

I'd be interested in hearing thoughts. In running these verification tasks
like Jepsen, are we getting coverage that we wouldn't get in our own
integration tests? And how is our test coverage for disk corruption?

-- 
Andrew Wong

On CORDS for Disk Failure Testing

Reply via email to