Here's a weird one... what's the best way to get a Cassandra node into a
"half-crashed" state?

We have a 3-node cluster running 0.7.5. A few days ago this happened
organically to node1 - the partition the commitlog was on was 100% full and
there was a "No space left on device" error, and after a while, although the
cluster and node1 was still up, to the other nodes it was down, and messages
like:
    DEBUG 14:36:55,546 ... timed out
started to show up in its debug logs.

We have a tool to indicate to the load balancer that a Cassandra node is
down, but it didn't detect it that time. Now I'm having trouble
purposefully getting the node back to that state, so that I can try other
monitoring methods. I've tried to fill up the commitlog partition with other
files, and although I get the "No space left on device" error, the node
still doesn't go down and show the other symptoms it showed before.

Also, if anyone could recommend a good way for a node itself to detect that
its in such a state I'd be interested in that too. Currently what we're
doing is making a "describe_cluster_name()" thrift call, but that still
worked when the node was "down". I'm thinking of something like
reading/writing to a fixed value in a keyspace as a check... Unfortunately
Java-based solutions are out of the question.


Thanks,
Suan

Reply via email to