Re: On CORDS for Disk Failure Testing

Todd Lipcon Tue, 18 Jul 2017 17:55:50 -0700

Hey Andrew,

Sorry for the slow response here. Catching up after being out of town.

I think it's reasonable to cut short the work on CORDS if it doesn't seem
to have a good return on invested time vs doing our own injection at the
app (env_posix) layer.

As for whether this gives us the _same_ coverage as CORDS, I think the main
advantage of CORDS is that it may expose when a certain file system
responds to corruption in an unexpected way. In particular, scenarios such
as an IO error followed by a machine restart might result in "strange"
corruptions such as truncated files, partially written extents, etc, which
we might not have specific injections for. For example, arbitrary pages of
data at the end of a WAL might have been partially written before a crash,
and then after restart we may have trouble recovering from this.

So, I think there might be value in running end-to-end system tests under a
system like CORDS in the future, but probably we'll get most of our mileage
(80/20 rule) from the type of injection you're doing at the app layer. If
we learn about more strange FS-level behaviors we can always simulate them
at the app layer as well.

-Todd

On Tue, Jul 11, 2017 at 12:20 PM, Andrew Wong <aw...@cloudera.com> wrote:

> Hi guys,
>
> I've been working on running Kudu with CORDS
> <https://github.com/aganesan4/CORDS>, a research project that was used to
> inject disk-level failures in various distributed systems (Kafka,
> Zookeeper, etc.).
>
> They approached this by mounting these projects on a filesystem that could
> inject various errors and corruptions to a single block. While this may be
> useful, I don't think we need this for testing disk failures. A few reasons
> for this:
>
>    - The disk failure injection in some of my (in-review) tests is already
>    quite similar to the injection here, but at a file-level instead of a
> block
>    level
>    - Their tests equate to writing up a bunch of test scenarios and
>    examining the outputs (via grep, sed, etc.) when running with faults
>    injected. We should be able to achieve this ourselves with integration
> tests
>    - The end-goal of running with CORDS is to see what happens when various
>    disk blocks fail I/O; we should already (and in the future) have a solid
>    understanding of what happens when these failures happen (and we should
> be
>    testing for them ourselves)
>    - May be nitpick-y, but their code for injection is poorly commented and
>    may be a hassle to maintain if we were to, say, run this as a recurring
>    job. It's not particularly hard to read through but it might be annoying
>    debugging errors from their end.
>    - Another thing I thought was a bit odd was not all of their code is
>    available for other systems (e.g. they only posted their code for
> ZooKeeper
>    and not Kafka, MongoDB, Redis, etc.)
>
> That isn't to say that I don't think there is value in running CORDS. It
> would be a nice check to make sure we're doing the right thing. However,
> the major thought I had from this is that we can do a lot of this testing
> ourselves (and perhaps even improve our testing/debugging infrastructure
> for more end-to-end tests).
>
> I'd be interested in hearing thoughts. In running these verification tasks
> like Jepsen, are we getting coverage that we wouldn't get in our own
> integration tests? And how is our test coverage for disk corruption?
>
> --
> Andrew Wong
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: On CORDS for Disk Failure Testing

Reply via email to