[
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889330#comment-13889330
]
Marcus Eriksson commented on CASSANDRA-6364:
--------------------------------------------
About the ignore case, lets hard code something for now - rate limit at one log
error message per second perhaps?
I don't think we should default to 'ignore' in Config.java - if someone does a
minor upgrade they most likely wont check NEWS or update their config files to
add the new parameter.
The shipped config in cassandra.yaml looks wrong, should be
commit_failure_policy, not disk_failure_policy I guess
> There should be different disk_failure_policies for data and commit volumes
> or commit volume failure should always cause node exit
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-6364
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Environment: JBOD, single dedicated commit disk
> Reporter: J. Ryan Earl
> Assignee: Benedict
> Fix For: 2.0.5
>
>
> We're doing fault testing on a pre-production Cassandra cluster. One of the
> tests was to simulation failure of the commit volume/disk, which in our case
> is on a dedicated disk. We expected failure of the commit volume to be
> handled somehow, but what we found was that no action was taken by Cassandra
> when the commit volume fail. We simulated this simply by pulling the
> physical disk that backed the commit volume, which resulted in filesystem I/O
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it
> was spending 90% of its time doing garbage collection. No errors were logged
> in regards to the failed commit volume. Gossip on other nodes in the cluster
> eventually flagged the node as down. Gossip on the local node showed itself
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node
> became very slow due to the on-going GC, as I assume uncommitted writes piled
> up on the JVM heap. What we believe should have happened is that Cassandra
> should have caught the I/O error and exited with a useful log message, or
> otherwise done some sort of useful cleanup. Otherwise the node goes into a
> sort of Zombie state, spending most of its time in GC, and thus slowing down
> any transactions that happen to use the coordinator on said node.
> A limit on in-memory, unflushed writes before refusing requests may also
> work. Point being, something should be done to handle the commit volume
> dying as doing nothing results in affecting the entire cluster. I should
> note, we are using: disk_failure_policy: best_effort
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)