[
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824719#comment-13824719
]
Mikhail Stepura commented on CASSANDRA-6364:
--------------------------------------------
bq. What we believe should have happened is that Cassandra should have caught
the I/O error and exited with a useful log message
I believe you should use {{disk_failure_policy: stop}} for that
> Cassandra should exit or otherwise handle when the commit volume dies
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-6364
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
> Project: Cassandra
> Issue Type: Improvement
> Environment: JBOD, single dedicated commit disk
> Reporter: J. Ryan Earl
>
> We're doing fault testing on a pre-production Cassandra cluster. One of the
> tests was to simulation failure of the commit volume/disk, which in our case
> is on a dedicated disk. We expected failure of the commit volume to be
> handled somehow, but what we found was that no action was taken by Cassandra
> when the commit volume fail. We simulated this simply by pulling the
> physical disk that backed the commit volume, which resulted in filesystem I/O
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it
> was spending 90% of its time doing garbage collection. No errors were logged
> in regards to the failed commit volume. Gossip on other nodes in the cluster
> eventually flagged the node as down. Gossip on the local node showed itself
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node
> became very slow due to the on-going GC, as I assume uncommitted writes piled
> up on the JVM heap. What we believe should have happened is that Cassandra
> should have caught the I/O error and exited with a useful log message, or
> otherwise done some sort of useful cleanup. Otherwise the node goes into a
> sort of Zombie state, spending most of its time in GC, and thus slowing down
> any transactions that happen to use the coordinator on said node.
> A limit on in-memory, unflushed writes before refusing requests may also
> work. Point being, something should be done to handle the commit volume
> dying as doing nothing results in affecting the entire cluster. I should
> note, we are using: disk_failure_policy: best_effort
--
This message was sent by Atlassian JIRA
(v6.1#6144)