Hi guys, I've been working on running Kudu with CORDS <https://github.com/aganesan4/CORDS>, a research project that was used to inject disk-level failures in various distributed systems (Kafka, Zookeeper, etc.).
They approached this by mounting these projects on a filesystem that could inject various errors and corruptions to a single block. While this may be useful, I don't think we need this for testing disk failures. A few reasons for this: - The disk failure injection in some of my (in-review) tests is already quite similar to the injection here, but at a file-level instead of a block level - Their tests equate to writing up a bunch of test scenarios and examining the outputs (via grep, sed, etc.) when running with faults injected. We should be able to achieve this ourselves with integration tests - The end-goal of running with CORDS is to see what happens when various disk blocks fail I/O; we should already (and in the future) have a solid understanding of what happens when these failures happen (and we should be testing for them ourselves) - May be nitpick-y, but their code for injection is poorly commented and may be a hassle to maintain if we were to, say, run this as a recurring job. It's not particularly hard to read through but it might be annoying debugging errors from their end. - Another thing I thought was a bit odd was not all of their code is available for other systems (e.g. they only posted their code for ZooKeeper and not Kafka, MongoDB, Redis, etc.) That isn't to say that I don't think there is value in running CORDS. It would be a nice check to make sure we're doing the right thing. However, the major thought I had from this is that we can do a lot of this testing ourselves (and perhaps even improve our testing/debugging infrastructure for more end-to-end tests). I'd be interested in hearing thoughts. In running these verification tasks like Jepsen, are we getting coverage that we wouldn't get in our own integration tests? And how is our test coverage for disk corruption? -- Andrew Wong
