[
https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626395#comment-14626395
]
Jason Brown commented on CASSANDRA-9765:
----------------------------------------
SHUTDOWN is only set as the gossip status (and gossiped around) when there is a
clean shutdown of the new node (a/k/a the operator doesn't "kill -9" it, for
example). Else, the gossip status for a new node will be one of three states:
boot (if the node started to bootstrap as a new node in the cluster), hibernate
(if the node was replacing a previous node), and null (node started to gossip,
but died or was bounced before transitioning to either bootstrap or hibernate).
{{checkForEndpointCollision()}} is only invoked when we're not replacing a
node, so we don't need to worry about gossip state that for the scope of this
ticket (it's also covered by {{isDeadState()}}).
[~Stefania]'s analysis about the SHUTDOWN state is basically correct, but both
bootstrap and "null" gossip states should also cause the check in
{{checkForEndpointCollision}} to fail. For completeness, in addition to
checking the dead states and the "not a member in TMD" that we currently do in
{{isFatClient()}}, we should check the gossip status is not SHUTDOWN nor null
or bootstrap. Hence, I propose a new method on Gossiper like this:
{code}
public boolean isSafeForBootstrap(InetAddress endpoint)
{
EndpointState epState = endpointStateMap.get(endpoint);
// if there's no previous state, or the node was previously removed
from the cluster, so we're good
if (epState == null || isDeadState(epState))
return true;
String state = ""; // call new method getApplicationState(epState);
// if there was no previous status, we must have failed to start
bootstrapping (but did start gossiping - see flow in SS.initServer())
if (state.equals(""))
return false;
// if previous state is NORMAL, then the node was (is!) actively legit
in the cluster; do not attempt to bootstrap in it's place.
// (node replace has a different path - see SS.prepareToJoin())
if (state.equals(VersionedValue.STATUS_NORMAL))
return false;
// check for either a failed bootstrap (and operator 'kill -9' the
process)
// or a legit shutdown (subject to the previous checks in this method)
List<String> states = new ArrayList<String>() {{
add(VersionedValue.STATUS_BOOTSTRAPPING);
add(VersionedValue.SHUTDOWN); }};
return states.contains(state);
}
{code}
We can bikeshed the name of the method and the implementation (I hacked this up
quickly), but the functionality should be what we want. Note: I have not had a
chance to test this yet, so it might not be 100% correct yet. I'm not calling
{{isFatClient()}} as all it does is make the extra call to {{TMD.isMember()}},
and the only way a node can be a member (wrt to TMD) is if it's gossip status
is NORMAL. It's possible this method could just replace the existing
functionality of {{isFatClient()}}, and, while initial investigation looks
reasonable, I'd have to consider the semantics in {{Gossiper.doStatusCheck()}}.
Also, [~Stefania], nice work adding the dtest. It would be nice to be able to
catch the additional statuses and ensure proper state transistions, but this is
a good start.
> checkForEndpointCollision fails for legitimate collisions
> ---------------------------------------------------------
>
> Key: CASSANDRA-9765
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Richard Low
> Assignee: Stefania
> Fix For: 2.0.17
>
>
> Since CASSANDRA-7939, checkForEndpointCollision no longer catches a
> legitimate collision. Without CASSANDRA-7939, wiping a node and starting it
> again fails with 'A node with address %s already exists', but with it the
> node happily enters joining state, potentially streaming from the wrong place
> and violating consistency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)