Nodes can get stuck in UP state forever, despite being DOWN
-----------------------------------------------------------
Key: CASSANDRA-3626
URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Peter Schuller
Assignee: Peter Schuller
This is a proposed phrasing for an upstream ticket named "Newly discovered
nodes that are down get stuck in UP state forever" (will edit w/ feedback until
done):
We have a observed a problem with gossip which, when you are bootstrapping a
new node (or replacing using the replace_token support), any node in the
cluster which is Down at the time the node is started, will be assumed to be Up
and then *never ever* flapped back to Down until you restart the node.
This has at least two implications to replacing or bootstrapping new nodes when
there are nodes down in the ring:
* If the new node happens to select a node listed as (UP but in reality is
DOWN) as a stream source, streaming will sit there hanging forever.
* If that doesn't happen (by picking another host), it will instead finish
bootstrapping correctly, and begin servicing requests all the while thinking
DOWN nodes are UP, and thus routing requests to them, generating timeouts.
The way to get out of this is to restart the node(s) that you bootstrapped.
I have tested and confirmed the symptom (that the bootstrapped node things
other nodes are Up) using a fairly recent 1.0. The main debugging effort
happened on 0.8 however, so all details below refer to 0.8 but are probably
similar in 1.0.
Steps to reproduce:
* Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster is
operative with one node removed.
* Pick two random nodes A, and B. Shut them *both* off.
* Wait for everyone to realize they are both off (for good measure).
* Now, take node A and nuke it's data directories and re-start it, such that it
comes up w/ normal bootstrap (or use replace_token; didn't test that but should
not affect it).
* Watch how node A starts up, all the while believing node B is down, even
though all other nodes in the cluster agree that B is down and B is in fact
still turned off.
The mechanism by which it initially goes into Up state is that the node
receives a gossip response from any other node in the cluster, and
GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().
Gossiper.applyStateLocally() doesn't have any local endpoint state for the
cluster, so the else statement at the end ("it's a new node") gets triggered
and handleMajorStateChange() is called. handleMajorStateChange() always calls
markAlive(), unless the state is a dead state (but "dead" here does not mean
"not up", but refers to joining/hibernate etc).
So at this point the node is up in the mind of the node you just bootstrapped.
Now, in each gossip round doStatusCheck() is called, which iterates over all
nodes (including the one falsly Up) and among other things, calls
FailureDetector.interpret() on each node.
FailureDetector.interpret() is meant to update its sense of Phi for the node,
and potentially convict it. However there is a short-circuit at the top,
whereby if we do not yet have any arrival window for the node, we simply return
immediately.
Arrival intervals are only added as a result of a FailureDetector.report()
call, which never happens in this case because the initial endpoint state we
added, which came from a remote node that was up, had the latest version of the
gossip state (so Gossiper.reportFailureDetector() will never call report()).
The result is that the node can never ever be convicted.
Now, let's ignore for a moment the problem that a node that is actually Down
will be thought to be Up temporarily for a little while. That is sub-optimal,
but let's aim for a fix to the more serious problem in this ticket - which is
that is stays up forever.
Considered solutions:
* When interpret() gets called and there is no arrival window, we could add a
faked arrival window far back in time to cause the node to have history and be
marked down. This "works" in the particular test case. The problem is that
since we are not ourselves actively trying to gossip to these nodes with any
particular speed, it might take a significant time before we get any kind of
confirmation from someone else that it's actually Up in cases where the node
actually *is* Up, so it's not clear that this is a good idea.
* When interpret() gets called and there is no arrival window, we can simply
convict it immediately. This has roughly similar behavior as the previous
suggestion.
* When interpret() gets called and there is no arrival window, we can add a
faked arrival window at the current time, which will allow it to be treated as
Up until the usual time has passed before we exceed the Phi conviction
threshold.
* When interpret() gets called and there is no arrival window, we can
immediately convict it, *and* schedule it for immediate gossip on the next
round in order to try to ensure they go Up quickly if they are indeed up. This
has an effect of O(n) gossip traffic, as a special case once during node
start-up. While theoretically a problem, I personally thing we can ignore it
for now since it won't be a significant problem any time soon. However, this is
more complicated since the way we queue up messages is asynchronously to
background connection attempts. We'd have to make sure the initial gossip
message actually gets sent on an open TCP connection (I haven't confirmed
whether this will be the case or not).
The first three are simple to implement, possibly the fourth. But in all cases,
I am worried about potential negative consequences that I am not seeing.
Thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira