[
https://issues.apache.org/jira/browse/CASSANDRA-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889343#comment-13889343
]
Benedict commented on CASSANDRA-6364:
-------------------------------------
bq. I don't think we should default to 'ignore' in Config.java
Well, I wasn't too sure about this. On the one hand switching the default to
"stop" means we could over cautiously kill user's hosts unexpectedly, maybe
resulting in interruption of service (especially, say, our users running on
SAN, as much as it is strongly discouraged). Whereas switching to "ignore"
means we may not be durable. Neither are great defaults, but both are better
than before. I'm comfortable with both, so if you feel strongly it should be
"stop", I'll happily switch it. Perhaps I lean slightly in favour of it too,
but it depends on if the user favours durability over availability, really, so
there doesn't seem a single correct answer to me. Note that the default
disk_failure_policy is also ignore, and the prior behaviour was closest to
ignore, so introducing a default that results in a failing node is somewhat
unprecedented for disk failure.
bq. The shipped config in cassandra.yaml looks wrong, should be
commit_failure_policy, not disk_failure_policy I guess
Right, looks like I didn't update the first or last lines I copy-pasted.
Thanks.
bq. About the ignore case, lets hard code something for now - rate limit at one
log error message per second perhaps?
If we're just rate limiting the log messages, I'd say one per minute might be
better. But I'm not sure having the threads spin trying to make progress is
useful. The PCLES, for instance, will just start burning one core until it can
successfully sync, assuming it doesn't actually have to wait each time to
encounter the error. Tempted to have a 1s pause after an error during which we
just sleep the erroring thread.
Another issue that slightly concerns me is what happens if the CLES sync()
starts failing, but the append and CLA doesn't. With "ignore" this could
potentially result in us mapping in and allocating huge amounts of disk space,
but not being able to sync or clear it. This might either result in lots of
swapping, and/or us exceeding by a large margin our max log space goal. Since
we never guarantee to keep to this I'm not sure how much of a problem it would
be, but an error down to ACLs that stops us syncing one file might potentally
end up eating up huge quantities of commit disk space. I'm tempted to have the
CLA thread block once it hits twice its goal max space (or maybe introduce a
second config parameter for a hard maximum). But I'm also tempted to leave
these changes for the 2.1 branch, since it's a fairly specific failure case,
and what we have is a big improvement over the current state of affairs.
> There should be different disk_failure_policies for data and commit volumes
> or commit volume failure should always cause node exit
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-6364
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6364
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Environment: JBOD, single dedicated commit disk
> Reporter: J. Ryan Earl
> Assignee: Benedict
> Fix For: 2.0.5
>
>
> We're doing fault testing on a pre-production Cassandra cluster. One of the
> tests was to simulation failure of the commit volume/disk, which in our case
> is on a dedicated disk. We expected failure of the commit volume to be
> handled somehow, but what we found was that no action was taken by Cassandra
> when the commit volume fail. We simulated this simply by pulling the
> physical disk that backed the commit volume, which resulted in filesystem I/O
> errors on the mount point.
> What then happened was that the Cassandra Heap filled up to the point that it
> was spending 90% of its time doing garbage collection. No errors were logged
> in regards to the failed commit volume. Gossip on other nodes in the cluster
> eventually flagged the node as down. Gossip on the local node showed itself
> as up, and all other nodes as down.
> The most serious problem was that connections to the coordinator on this node
> became very slow due to the on-going GC, as I assume uncommitted writes piled
> up on the JVM heap. What we believe should have happened is that Cassandra
> should have caught the I/O error and exited with a useful log message, or
> otherwise done some sort of useful cleanup. Otherwise the node goes into a
> sort of Zombie state, spending most of its time in GC, and thus slowing down
> any transactions that happen to use the coordinator on said node.
> A limit on in-memory, unflushed writes before refusing requests may also
> work. Point being, something should be done to handle the commit volume
> dying as doing nothing results in affecting the entire cluster. I should
> note, we are using: disk_failure_policy: best_effort
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)