[jira] [Commented] (CASSANDRA-6364) There should be different disk_failure_policies for data and commit volumes or commit volume failure should always cause node exit

Marcus Eriksson (JIRA) Mon, 03 Feb 2014 01:04:46 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889330#comment-13889330
 ]


Marcus Eriksson commented on CASSANDRA-6364:
--------------------------------------------

About the ignore case, lets hard code something for now - rate limit at one log 
error message per second perhaps?

I don't think we should default to 'ignore' in Config.java - if someone does a 
minor upgrade they most likely wont check NEWS or update their config files to 
add the new parameter.

The shipped config in cassandra.yaml looks wrong, should be 
commit_failure_policy, not disk_failure_policy I guess

> There should be different disk_failure_policies for data and commit volumes 
> or commit volume failure should always cause node exit
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6364
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: JBOD, single dedicated commit disk
>            Reporter: J. Ryan Earl
>            Assignee: Benedict
>             Fix For: 2.0.5
>
>
> We're doing fault testing on a pre-production Cassandra cluster.  One of the 
> tests was to simulation failure of the commit volume/disk, which in our case 
> is on a dedicated disk.  We expected failure of the commit volume to be 
> handled somehow, but what we found was that no action was taken by Cassandra 
> when the commit volume fail.  We simulated this simply by pulling the 
> physical disk that backed the commit volume, which resulted in filesystem I/O 
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it 
> was spending 90% of its time doing garbage collection.  No errors were logged 
> in regards to the failed commit volume.  Gossip on other nodes in the cluster 
> eventually flagged the node as down.  Gossip on the local node showed itself 
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node 
> became very slow due to the on-going GC, as I assume uncommitted writes piled 
> up on the JVM heap.  What we believe should have happened is that Cassandra 
> should have caught the I/O error and exited with a useful log message, or 
> otherwise done some sort of useful cleanup.  Otherwise the node goes into a 
> sort of Zombie state, spending most of its time in GC, and thus slowing down 
> any transactions that happen to use the coordinator on said node.
> A limit on in-memory, unflushed writes before refusing requests may also 
> work.  Point being, something should be done to handle the commit volume 
> dying as doing nothing results in affecting the entire cluster.  I should 
> note, we are using: disk_failure_policy: best_effort



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6364) There should be different disk_failure_policies for data and commit volumes or commit volume failure should always cause node exit

Reply via email to