[ 
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J. Ryan Earl updated CASSANDRA-6364:
------------------------------------

    Description: 
We're doing fault testing on a pre-production Cassandra cluster.  One of the 
tests was to simulation failure of the commit volume/disk, which in our case is 
on a dedicated disk.  We expected failure of the commit volume to be handled 
somehow, but what we found was that no action was taken by Cassandra when the 
commit volume fail.  We simulated this simply by pulling the physical disk that 
backed the commit volume, which resulted in filesystem I/O errors on the mount 
point.

What then happened was that the Cassandra Heap filled up to the point that it 
was spending 90% of its time doing garbage collection.  No errors were logged 
in regards to the failed commit volume.  Gossip on other nodes in the cluster 
eventually flagged the node as down.  Gossip on the local node showed itself as 
up, and all other nodes as down.

The most serious problem was that connections to the coordinator on this node 
became very slow due to the on-going GC, as I assume uncommitted writes piled 
up on the JVM heap.  What we believe should have happened is that Cassandra 
should have caught the I/O error and exited with a useful log message.  
Otherwise the node goes into a sort of Zombie state--spending most of its time 
in GC--slowing down any transactions that happen to use the coordinator on said 
node.

  was:
We're doing fault testing on a pre-production Cassandra cluster.  One of the 
tests was to simulation failure of the commit volume/disk, which in our case is 
on a dedicated disk.  We expected failure of the commit volume to be handled 
somehow, but what we found was that no action was taken by Cassandra when the 
commit volume fail.  We simulated this simply by pulling the physical disk that 
backed the commit volume, when resulted in filesystem I/O errors on the mount 
point.

What then happened was that the Cassandra Heap filled up to the point that it 
was spending 90% of its time doing garbage collection.  No errors were logged 
in regards to the failed commit volume.  Gossip on other nodes in the cluster 
eventually flagged the node as down.  Gossip on the local node showed itself as 
up, and all other nodes as down.

The most serious problem was that connections to the coordinator on this node 
became very slow due to the on-going GC, as I assume uncommitted writes piled 
up on the JVM heap.  What we believe should have happen is that Cassandra 
should have caught the I/O error and exited with a useful log message.  
Otherwise the node goes into a sort of Zombie state--spending most of its time 
in GC--slowing down any transactions that happen to use the coordinator on said 
node.


> Cassandra should exist when commit volume dies
> ----------------------------------------------
>
>                 Key: CASSANDRA-6364
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
>             Project: Cassandra
>          Issue Type: Improvement
>         Environment: JBOD, single dedicated commit disk
>            Reporter: J. Ryan Earl
>
> We're doing fault testing on a pre-production Cassandra cluster.  One of the 
> tests was to simulation failure of the commit volume/disk, which in our case 
> is on a dedicated disk.  We expected failure of the commit volume to be 
> handled somehow, but what we found was that no action was taken by Cassandra 
> when the commit volume fail.  We simulated this simply by pulling the 
> physical disk that backed the commit volume, which resulted in filesystem I/O 
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it 
> was spending 90% of its time doing garbage collection.  No errors were logged 
> in regards to the failed commit volume.  Gossip on other nodes in the cluster 
> eventually flagged the node as down.  Gossip on the local node showed itself 
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node 
> became very slow due to the on-going GC, as I assume uncommitted writes piled 
> up on the JVM heap.  What we believe should have happened is that Cassandra 
> should have caught the I/O error and exited with a useful log message.  
> Otherwise the node goes into a sort of Zombie state--spending most of its 
> time in GC--slowing down any transactions that happen to use the coordinator 
> on said node.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to