[jira] [Commented] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN

Peter Schuller (Commented) (JIRA) Thu, 15 Dec 2011 10:59:03 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170409#comment-13170409
 ]


Peter Schuller commented on CASSANDRA-3626:
-------------------------------------------

+1. :)
                
> Nodes can get stuck in UP state forever, despite being DOWN
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-3626
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.8, 1.0.5
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: 3626.txt
>
>
> This is a proposed phrasing for an upstream ticket named "Newly discovered 
> nodes that are down get stuck in UP state forever" (will edit w/ feedback 
> until done):
> We have a observed a problem with gossip which, when you are bootstrapping a 
> new node (or replacing using the replace_token support), any node in the 
> cluster which is Down at the time the node is started, will be assumed to be 
> Up and then *never ever* flapped back to Down until you restart the node.
> This has at least two implications to replacing or bootstrapping new nodes 
> when there are nodes down in the ring:
> * If the new node happens to select a node listed as (UP but in reality is 
> DOWN) as a stream source, streaming will sit there hanging forever.
> * If that doesn't happen (by picking another host), it will instead finish 
> bootstrapping correctly, and begin servicing requests all the while thinking 
> DOWN nodes are UP, and thus routing requests to them, generating timeouts.
> The way to get out of this is to restart the node(s) that you bootstrapped.
> I have tested and confirmed the symptom (that the bootstrapped node things 
> other nodes are Up) using a fairly recent 1.0. The main debugging effort 
> happened on 0.8 however, so all details below refer to 0.8 but are probably 
> similar in 1.0.
> Steps to reproduce:
> * Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster 
> is operative with one node removed.
> * Pick two random nodes A, and B. Shut them *both* off.
> * Wait for everyone to realize they are both off (for good measure).
> * Now, take node A and nuke it's data directories and re-start it, such that 
> it comes up w/ normal bootstrap (or use replace_token; didn't test that but 
> should not affect it).
> * Watch how node A starts up, all the while believing node B is down, even 
> though all other nodes in the cluster agree that B is down and B is in fact 
> still turned off.
> The mechanism by which it initially goes into Up state is that the node 
> receives a gossip response from any other node in the cluster, and 
> GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().
> Gossiper.applyStateLocally() doesn't have any local endpoint state for the 
> cluster, so the else statement at the end ("it's a new node") gets triggered 
> and handleMajorStateChange() is called. handleMajorStateChange() always calls 
> markAlive(), unless the state is a dead state (but "dead" here does not mean 
> "not up", but refers to joining/hibernate etc).
> So at this point the node is up in the mind of the node you just bootstrapped.
> Now, in each gossip round doStatusCheck() is called, which iterates over all 
> nodes (including the one falsly Up) and among other things, calls 
> FailureDetector.interpret() on each node.
> FailureDetector.interpret() is meant to update its sense of Phi for the node, 
> and potentially convict it. However there is a short-circuit at the top, 
> whereby if we do not yet have any arrival window for the node, we simply 
> return immediately.
> Arrival intervals are only added as a result of a FailureDetector.report() 
> call, which never happens in this case because the initial endpoint state we 
> added, which came from a remote node that was up, had the latest version of 
> the gossip state (so Gossiper.reportFailureDetector() will never call 
> report()).
> The result is that the node can never ever be convicted.
> Now, let's ignore for a moment the problem that a node that is actually Down 
> will be thought to be Up temporarily for a little while. That is sub-optimal, 
> but let's aim for a fix to the more serious problem in this ticket - which is 
> that is stays up forever.
> Considered solutions:
> * When interpret() gets called and there is no arrival window, we could add a 
> faked arrival window far back in time to cause the node to have history and 
> be marked down. This "works" in the particular test case. The problem is that 
> since we are not ourselves actively trying to gossip to these nodes with any 
> particular speed, it might take a significant time before we get any kind of 
> confirmation from someone else that it's actually Up in cases where the node 
> actually *is* Up, so it's not clear that this is a good idea.
> * When interpret() gets called and there is no arrival window, we can simply 
> convict it immediately. This has roughly similar behavior as the previous 
> suggestion.
> * When interpret() gets called and there is no arrival window, we can add a 
> faked arrival window at the current time, which will allow it to be treated 
> as Up until the usual time has passed before we exceed the Phi conviction 
> threshold.
> * When interpret() gets called and there is no arrival window, we can 
> immediately convict it, *and* schedule it for immediate gossip on the next 
> round in order to try to ensure they go Up quickly if they are indeed up. 
> This has an effect of O(n) gossip traffic, as a special case once during node 
> start-up. While theoretically a problem, I personally thing we can ignore it 
> for now since it won't be a significant problem any time soon. However, this 
> is more complicated since the way we queue up messages is asynchronously to 
> background connection attempts. We'd have to make sure the initial gossip 
> message actually gets sent on an open TCP connection (I haven't confirmed 
> whether this will be the case or not).
> The first three are simple to implement, possibly the fourth. But in all 
> cases, I am worried about potential negative consequences that I am not 
> seeing.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3626) Nodes can get stuck in UP state forever, despite being DOWN

Reply via email to