[
https://issues.apache.org/jira/browse/KUDU-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088250#comment-16088250
]
Andrew Wong edited comment on KUDU-1960 at 7/14/17 10:58 PM:
-------------------------------------------------------------
I spent some time messing with CORDS. They approached this by mounting these
projects on a filesystem that could inject various errors and corruptions to a
single block. While this may be useful, I don't think we need this for testing
disk failures. A few reasons for this:
* The disk failure injection in some of my (in-review) tests is already quite
similar to the injection here, but at a file-level instead of a block level
* Their tests equate to writing up a bunch of test scenarios and examining the
outputs (via grep, sed, etc.) when running with faults injected. We should be
able to achieve this ourselves with integration tests
* The end-goal of running with CORDS is to see what happens when various disk
blocks fail I/O; we should already (and in the future) have a solid
understanding of what happens when these failures happen (and we should be
testing for them ourselves)
* May be nitpick-y, but their code for injection is poorly commented and may be
a hassle to maintain if we were to, say, run this as a recurring job. It's not
particularly hard to read through but it might be annoying debugging errors
from their end.
* Another thing I thought was a bit odd was not all of their code is available
for other systems (e.g. they only posted their code for ZooKeeper and not
Kafka, MongoDB, Redis, etc.)
That isn't to say that I don't think there is value in running CORDS or
similar. It would be a nice check to make sure we're doing the right thing.
However, the major thought I had from this is that we can do a lot of this
testing ourselves (and perhaps even improve our testing/debugging
infrastructure for more end-to-end tests).
was (Author: andrew.wong):
I spent some time messing with CORDS. They approached this by mounting these
projects on a filesystem that could inject various errors and corruptions to a
single block. While this may be useful, I don't think we need this for testing
disk failures. A few reasons for this:
* The disk failure injection in some of my (in-review) tests is already quite
similar to the injection here, but at a file-level instead of a block level
* Their tests equate to writing up a bunch of test scenarios and examining the
outputs (via grep, sed, etc.) when running with faults injected. We should be
able to achieve this ourselves with integration tests
* The end-goal of running with CORDS is to see what happens when various disk
blocks fail I/O; we should already (and in the future) have a solid
understanding of what happens when these failures happen (and we should be
testing for them ourselves)
* May be nitpick-y, but their code for injection is poorly commented and may be
a hassle to maintain if we were to, say, run this as a recurring job. It's not
particularly hard to read through but it might be annoying debugging errors
from their end.
* Another thing I thought was a bit odd was not all of their code is available
for other systems (e.g. they only posted their code for ZooKeeper and not
Kafka, MongoDB, Redis, etc.)
That isn't to say that I don't think there is value in running CORDS or
similar. It would be a nice check to make sure we're doing the right thing.
However, the major thought I had from this is that we can do a lot of this
testing ourselves (and perhaps even improve our testing/debugging
infrastructure for more end-to-end tests).
> Run CORDS or similar tests on Kudu
> ----------------------------------
>
> Key: KUDU-1960
> URL: https://issues.apache.org/jira/browse/KUDU-1960
> Project: Kudu
> Issue Type: Task
> Components: test
> Reporter: Grant Henke
> Assignee: Andrew Wong
>
> "CORDS is a fault-injection system consisting of errfs, a FUSE file system,
> and errbench, a set of workloads and a behaviour inference script for each
> system under test."
> * Overview & link to source code:
> http://research.cs.wisc.edu/adsl/Software/cords/
> * Whitepaper and presentation:
> https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan
> * Blog:
> https://blog.acolyer.org/2017/03/08/redundancy-does-not-imply-fault-tolerance-analysis-of-distributed-storage-reactions-to-single-errors-and-corruptions/
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)