J. Ryan Earl created CASSANDRA-6364:
---------------------------------------
Summary: Cassandra should exist when commit volume dies
Key: CASSANDRA-6364
URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
Project: Cassandra
Issue Type: Improvement
Environment: JBOD, single dedicated commit disk
Reporter: J. Ryan Earl
We're doing fault testing on a pre-production Cassandra cluster. One of the
tests was to simulation failure of the commit volume/disk, which in our case is
on a dedicated disk. We expected failure of the commit volume to be handled
somehow, but what we found was that no action was taken by Cassandra when the
commit volume fail. We simulated this simply by pulling the physical disk that
backed the commit volume, when resulted in filesystem I/O errors on the mount
point.
What then happened was that the Cassandra Heap filled up to the point that it
was spending 90% of its time doing garbage collection. No errors were logged
in regards to the failed commit volume. Gossip on other nodes in the cluster
eventually flagged the node as down. Gossip on the local node showed itself as
up, and all other nodes as down.
The most serious problem was that connections to the coordinator on this node
became very slow due to the on-going GC, as I assume uncommitted writes piled
up on the JVM heap. What we believe should have happen is that Cassandra
should have caught the I/O error and exited with a useful log message.
Otherwise the node goes into a sort of Zombie state--spending most of its time
in GC--slowing down any transactions that happen to use the coordinator on said
node.
--
This message was sent by Atlassian JIRA
(v6.1#6144)