[ 
https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626395#comment-14626395
 ] 

Jason Brown commented on CASSANDRA-9765:
----------------------------------------

SHUTDOWN is only set as the gossip status (and gossiped around) when there is a 
clean shutdown of the new node (a/k/a the operator doesn't "kill -9" it, for 
example). Else, the gossip status for a new node will be one of three states: 
boot (if the node started to bootstrap as a new node in the cluster), hibernate 
(if the node was replacing a previous node), and null (node started to gossip, 
but died or was bounced before transitioning to either bootstrap or hibernate). 
{{checkForEndpointCollision()}} is only invoked when we're not replacing a 
node, so we don't need to worry about gossip state that for the scope of this 
ticket (it's also covered by {{isDeadState()}}).

[~Stefania]'s analysis about the SHUTDOWN state is basically correct, but both 
bootstrap and "null" gossip states should also cause the check in 
{{checkForEndpointCollision}} to fail. For completeness, in addition to 
checking the dead states and the "not a member in TMD" that we currently do in 
{{isFatClient()}}, we should check the gossip status is not SHUTDOWN nor null 
or bootstrap. Hence, I propose a new method on Gossiper like this:
{code}
    public boolean isSafeForBootstrap(InetAddress endpoint)
    {
        EndpointState epState = endpointStateMap.get(endpoint);
        // if there's no previous state, or the node was previously removed 
from the cluster, so we're good
        if (epState == null || isDeadState(epState))
            return true;

        String state = ""; // call new method getApplicationState(epState);
        // if there was no previous status, we must have failed to start 
bootstrapping (but did start gossiping - see flow in SS.initServer())
        if (state.equals(""))
            return false;
        // if previous state is NORMAL, then the node was (is!) actively legit 
in the cluster; do not attempt to bootstrap in it's place. 
        // (node replace has a different path - see SS.prepareToJoin())
        if (state.equals(VersionedValue.STATUS_NORMAL))
            return false;

        // check for either a failed bootstrap (and operator 'kill -9' the 
process)
        // or a legit shutdown (subject to the previous checks in this method)
        List<String> states = new ArrayList<String>() {{ 
add(VersionedValue.STATUS_BOOTSTRAPPING);
                                                          
add(VersionedValue.SHUTDOWN); }};
        return states.contains(state);
    }
{code}
We can bikeshed the name of the method and the implementation (I hacked this up 
quickly), but the functionality should be what we want. Note: I have not had a 
chance to test this yet, so it might not be 100% correct yet. I'm not calling 
{{isFatClient()}} as all it does is make the extra call to {{TMD.isMember()}}, 
and the only way a node can be a member (wrt to TMD) is if it's gossip status 
is NORMAL. It's possible this method could just replace the existing 
functionality of {{isFatClient()}}, and, while initial investigation looks 
reasonable, I'd have to consider the semantics in {{Gossiper.doStatusCheck()}}.

Also, [~Stefania], nice work adding the dtest. It would be nice to be able to 
catch the additional statuses and ensure proper state transistions, but this is 
a good start.

> checkForEndpointCollision fails for legitimate collisions
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9765
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Richard Low
>            Assignee: Stefania
>             Fix For: 2.0.17
>
>
> Since CASSANDRA-7939, checkForEndpointCollision no longer catches a 
> legitimate collision. Without CASSANDRA-7939, wiping a node and starting it 
> again fails with 'A node with address %s already exists', but with it the 
> node happily enters joining state, potentially streaming from the wrong place 
> and violating consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to