[
https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Williams updated CASSANDRA-3626:
----------------------------------------
Attachment: 3626.txt
bq. Now, let's ignore for a moment the problem that a node that is actually
Down will be thought to be Up temporarily for a little while.
This is perfectly fine since this is what the SS.RING_DELAY stabilization
period is designed to suss out - the correct ring state.
bq. Steps to reproduce:
This can be simplified to "start two nodes, shut one down, add a third."
bq. Considered solutions:
ISTM the simplest and most correct thing to do in this case is report() new
nodes. Patch to do so.
> Nodes can get stuck in UP state forever, despite being DOWN
> -----------------------------------------------------------
>
> Key: CASSANDRA-3626
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.8.8, 1.0.5
> Reporter: Peter Schuller
> Assignee: Peter Schuller
> Attachments: 3626.txt
>
>
> This is a proposed phrasing for an upstream ticket named "Newly discovered
> nodes that are down get stuck in UP state forever" (will edit w/ feedback
> until done):
> We have a observed a problem with gossip which, when you are bootstrapping a
> new node (or replacing using the replace_token support), any node in the
> cluster which is Down at the time the node is started, will be assumed to be
> Up and then *never ever* flapped back to Down until you restart the node.
> This has at least two implications to replacing or bootstrapping new nodes
> when there are nodes down in the ring:
> * If the new node happens to select a node listed as (UP but in reality is
> DOWN) as a stream source, streaming will sit there hanging forever.
> * If that doesn't happen (by picking another host), it will instead finish
> bootstrapping correctly, and begin servicing requests all the while thinking
> DOWN nodes are UP, and thus routing requests to them, generating timeouts.
> The way to get out of this is to restart the node(s) that you bootstrapped.
> I have tested and confirmed the symptom (that the bootstrapped node things
> other nodes are Up) using a fairly recent 1.0. The main debugging effort
> happened on 0.8 however, so all details below refer to 0.8 but are probably
> similar in 1.0.
> Steps to reproduce:
> * Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster
> is operative with one node removed.
> * Pick two random nodes A, and B. Shut them *both* off.
> * Wait for everyone to realize they are both off (for good measure).
> * Now, take node A and nuke it's data directories and re-start it, such that
> it comes up w/ normal bootstrap (or use replace_token; didn't test that but
> should not affect it).
> * Watch how node A starts up, all the while believing node B is down, even
> though all other nodes in the cluster agree that B is down and B is in fact
> still turned off.
> The mechanism by which it initially goes into Up state is that the node
> receives a gossip response from any other node in the cluster, and
> GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().
> Gossiper.applyStateLocally() doesn't have any local endpoint state for the
> cluster, so the else statement at the end ("it's a new node") gets triggered
> and handleMajorStateChange() is called. handleMajorStateChange() always calls
> markAlive(), unless the state is a dead state (but "dead" here does not mean
> "not up", but refers to joining/hibernate etc).
> So at this point the node is up in the mind of the node you just bootstrapped.
> Now, in each gossip round doStatusCheck() is called, which iterates over all
> nodes (including the one falsly Up) and among other things, calls
> FailureDetector.interpret() on each node.
> FailureDetector.interpret() is meant to update its sense of Phi for the node,
> and potentially convict it. However there is a short-circuit at the top,
> whereby if we do not yet have any arrival window for the node, we simply
> return immediately.
> Arrival intervals are only added as a result of a FailureDetector.report()
> call, which never happens in this case because the initial endpoint state we
> added, which came from a remote node that was up, had the latest version of
> the gossip state (so Gossiper.reportFailureDetector() will never call
> report()).
> The result is that the node can never ever be convicted.
> Now, let's ignore for a moment the problem that a node that is actually Down
> will be thought to be Up temporarily for a little while. That is sub-optimal,
> but let's aim for a fix to the more serious problem in this ticket - which is
> that is stays up forever.
> Considered solutions:
> * When interpret() gets called and there is no arrival window, we could add a
> faked arrival window far back in time to cause the node to have history and
> be marked down. This "works" in the particular test case. The problem is that
> since we are not ourselves actively trying to gossip to these nodes with any
> particular speed, it might take a significant time before we get any kind of
> confirmation from someone else that it's actually Up in cases where the node
> actually *is* Up, so it's not clear that this is a good idea.
> * When interpret() gets called and there is no arrival window, we can simply
> convict it immediately. This has roughly similar behavior as the previous
> suggestion.
> * When interpret() gets called and there is no arrival window, we can add a
> faked arrival window at the current time, which will allow it to be treated
> as Up until the usual time has passed before we exceed the Phi conviction
> threshold.
> * When interpret() gets called and there is no arrival window, we can
> immediately convict it, *and* schedule it for immediate gossip on the next
> round in order to try to ensure they go Up quickly if they are indeed up.
> This has an effect of O(n) gossip traffic, as a special case once during node
> start-up. While theoretically a problem, I personally thing we can ignore it
> for now since it won't be a significant problem any time soon. However, this
> is more complicated since the way we queue up messages is asynchronously to
> background connection attempts. We'd have to make sure the initial gossip
> message actually gets sent on an open TCP connection (I haven't confirmed
> whether this will be the case or not).
> The first three are simple to implement, possibly the fourth. But in all
> cases, I am worried about potential negative consequences that I am not
> seeing.
> Thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira