An important read

Andrew Purtell Mon, 06 Oct 2014 20:56:56 -0700

https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf


Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-intensive Systems
Yuan et. al, University of Toronto

Large, production quality distributed systems still fail periodically, and
do so sometimes catastrophically, where most or all users experience an
outage or data loss. We present the result of a comprehensive study
investigating 198 randomly selected, user-reported failures that occurred
on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop
MapReduce, and Redis, with the goal of understanding how one or multiple
faults eventually evolve into a user-visible failure. We found that from a
testing point of view, almost all failures require only 3 or fewer nodes to
reproduce, which is good news considering that these services typically run
on a very large number of nodes. However, multiple inputs are needed to
trigger the failures with the order between them being important. Finally,
we found the error logs of these systems typically contain sufficient data
on both the errors and the input events that triggered the failure,
enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been
prevented by performing simple testing on error handling code – the last
line of defense – even without an understanding of the software design. We
extracted three simple rules from the bugs that have lead to some of the
catastrophic failures, and developed a static checker, Aspirator, capable
of locating these bugs. Over 30% of the catastrophic failures would have
been prevented had Aspirator been used and the identified bugs fixed.
Running Aspirator on the code of 9 distributed systems located 143 bugs and
bad practices that have been fixed or confirmed by the developers.


This is an interesting benefit of open source and open development
process. Please read this detailed analysis of availability and data loss
bugs resulting from improper error handling, in HBase and other systems.
The authors focus on a particular pattern of defect and cause. The point is
well taken. It would be worth taking time where possible to revisit
exception handling, especially where we have low test coverage.

Also, consider HBASE-11912. The static analyses mentioned in this paper
could likely be implemented with error-prone. Development and code review
will always be uneven in a volunteer open source project. However if we
agree on some baseline practices, and those are amenable to static
analysis, then we could build validation of those practices into the
compiler, in effect.

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

An important read

Reply via email to