[ 
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J. Ryan Earl updated CASSANDRA-6364:
------------------------------------

    Summary: Cassandra should exit or otherwise handle when the commit volume 
dies  (was: Cassandra should exit when commit volume dies)

> Cassandra should exit or otherwise handle when the commit volume dies
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-6364
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
>             Project: Cassandra
>          Issue Type: Improvement
>         Environment: JBOD, single dedicated commit disk
>            Reporter: J. Ryan Earl
>
> We're doing fault testing on a pre-production Cassandra cluster.  One of the 
> tests was to simulation failure of the commit volume/disk, which in our case 
> is on a dedicated disk.  We expected failure of the commit volume to be 
> handled somehow, but what we found was that no action was taken by Cassandra 
> when the commit volume fail.  We simulated this simply by pulling the 
> physical disk that backed the commit volume, which resulted in filesystem I/O 
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it 
> was spending 90% of its time doing garbage collection.  No errors were logged 
> in regards to the failed commit volume.  Gossip on other nodes in the cluster 
> eventually flagged the node as down.  Gossip on the local node showed itself 
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node 
> became very slow due to the on-going GC, as I assume uncommitted writes piled 
> up on the JVM heap.  What we believe should have happened is that Cassandra 
> should have caught the I/O error and exited with a useful log message.  
> Otherwise the node goes into a sort of Zombie state, spending most of its 
> time in GC, and thus slowing down any transactions that happen to use the 
> coordinator on said node.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to