[ https://issues.apache.org/jira/browse/CASSANDRA-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170409#comment-13170409 ]
Peter Schuller commented on CASSANDRA-3626: ------------------------------------------- +1. :) > Nodes can get stuck in UP state forever, despite being DOWN > ----------------------------------------------------------- > > Key: CASSANDRA-3626 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3626 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.8.8, 1.0.5 > Reporter: Peter Schuller > Assignee: Peter Schuller > Attachments: 3626.txt > > > This is a proposed phrasing for an upstream ticket named "Newly discovered > nodes that are down get stuck in UP state forever" (will edit w/ feedback > until done): > We have a observed a problem with gossip which, when you are bootstrapping a > new node (or replacing using the replace_token support), any node in the > cluster which is Down at the time the node is started, will be assumed to be > Up and then *never ever* flapped back to Down until you restart the node. > This has at least two implications to replacing or bootstrapping new nodes > when there are nodes down in the ring: > * If the new node happens to select a node listed as (UP but in reality is > DOWN) as a stream source, streaming will sit there hanging forever. > * If that doesn't happen (by picking another host), it will instead finish > bootstrapping correctly, and begin servicing requests all the while thinking > DOWN nodes are UP, and thus routing requests to them, generating timeouts. > The way to get out of this is to restart the node(s) that you bootstrapped. > I have tested and confirmed the symptom (that the bootstrapped node things > other nodes are Up) using a fairly recent 1.0. The main debugging effort > happened on 0.8 however, so all details below refer to 0.8 but are probably > similar in 1.0. > Steps to reproduce: > * Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster > is operative with one node removed. > * Pick two random nodes A, and B. Shut them *both* off. > * Wait for everyone to realize they are both off (for good measure). > * Now, take node A and nuke it's data directories and re-start it, such that > it comes up w/ normal bootstrap (or use replace_token; didn't test that but > should not affect it). > * Watch how node A starts up, all the while believing node B is down, even > though all other nodes in the cluster agree that B is down and B is in fact > still turned off. > The mechanism by which it initially goes into Up state is that the node > receives a gossip response from any other node in the cluster, and > GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally(). > Gossiper.applyStateLocally() doesn't have any local endpoint state for the > cluster, so the else statement at the end ("it's a new node") gets triggered > and handleMajorStateChange() is called. handleMajorStateChange() always calls > markAlive(), unless the state is a dead state (but "dead" here does not mean > "not up", but refers to joining/hibernate etc). > So at this point the node is up in the mind of the node you just bootstrapped. > Now, in each gossip round doStatusCheck() is called, which iterates over all > nodes (including the one falsly Up) and among other things, calls > FailureDetector.interpret() on each node. > FailureDetector.interpret() is meant to update its sense of Phi for the node, > and potentially convict it. However there is a short-circuit at the top, > whereby if we do not yet have any arrival window for the node, we simply > return immediately. > Arrival intervals are only added as a result of a FailureDetector.report() > call, which never happens in this case because the initial endpoint state we > added, which came from a remote node that was up, had the latest version of > the gossip state (so Gossiper.reportFailureDetector() will never call > report()). > The result is that the node can never ever be convicted. > Now, let's ignore for a moment the problem that a node that is actually Down > will be thought to be Up temporarily for a little while. That is sub-optimal, > but let's aim for a fix to the more serious problem in this ticket - which is > that is stays up forever. > Considered solutions: > * When interpret() gets called and there is no arrival window, we could add a > faked arrival window far back in time to cause the node to have history and > be marked down. This "works" in the particular test case. The problem is that > since we are not ourselves actively trying to gossip to these nodes with any > particular speed, it might take a significant time before we get any kind of > confirmation from someone else that it's actually Up in cases where the node > actually *is* Up, so it's not clear that this is a good idea. > * When interpret() gets called and there is no arrival window, we can simply > convict it immediately. This has roughly similar behavior as the previous > suggestion. > * When interpret() gets called and there is no arrival window, we can add a > faked arrival window at the current time, which will allow it to be treated > as Up until the usual time has passed before we exceed the Phi conviction > threshold. > * When interpret() gets called and there is no arrival window, we can > immediately convict it, *and* schedule it for immediate gossip on the next > round in order to try to ensure they go Up quickly if they are indeed up. > This has an effect of O(n) gossip traffic, as a special case once during node > start-up. While theoretically a problem, I personally thing we can ignore it > for now since it won't be a significant problem any time soon. However, this > is more complicated since the way we queue up messages is asynchronously to > background connection attempts. We'd have to make sure the initial gossip > message actually gets sent on an open TCP connection (I haven't confirmed > whether this will be the case or not). > The first three are simple to implement, possibly the fourth. But in all > cases, I am worried about potential negative consequences that I am not > seeing. > Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira