[
https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627739#comment-14627739
]
Stefania commented on CASSANDRA-9765:
-------------------------------------
Thank you for your in-dept analysis Jason. If I understood correctly, this
method should be called in {{checkForEndpointCollision}} as follows and replace
all other conditions:
{code}
if (!Gossiper.instance.isSafeForBootstrap(FBUtilities.getBroadcastAddress()))
{code}
I think it makes sense to look at the gossip states rather than
{{TMD.isMember()}}. I am still unclear however on the empty, BOOTSTRAP and
SHUTDOWN states. In your method the first is not allowed and the last two are
both allowed.
According to CASSANDRA-7939 BOOTSTRAP should be allowed but not SHUTDOWN?
If you clarify which states are OK to join the cluster I can rewrite the
method, here is what I have so far:
{code}
public boolean isSafeForBootstrap(InetAddress endpoint)
{
EndpointState epState = endpointStateMap.get(endpoint);
// if there's no previous state, or the node was previously removed
from the cluster, we're good
if (epState == null || isDeadState(epState))
return true;
String state = getApplicationState(epState);
// these states are not allowed to join the cluster
List<String> states = new ArrayList<String>() {{
add(""); // failed bootstrap but we did start gossiping <== is this
allowed or NOT? (CASSANDRA-7939)
add(VersionedValue.STATUS_NORMAL); // node is legit in the cluster
or it was stopped with kill -9 <== definitely not allowed
add(VersionedValue.STATUS_BOOTSTRAPPING); // failed bootstrap (e.g.
kill -9 during bootstrap) <== is this allowed or NOT? (CASSANDRA-7939)
add(VersionedValue.SHUTDOWN); }}; // node was shutdown but not
removed <== not allowed also, right?
return !states.contains(state);
}
{code}
As for the tests, I was able to add 3 more tests so right now we have the
following states:
* SHUTDOWN - node is shutdown gracefully, test name :
_shutdown_wiped_node_cannot_join_test_
* NORMAL - node is shutdown with kill -9, test name :
_killed_wiped_node_cannot_join_test_
* LEFT - node is decommissioned, test name :
_decommissioned_wiped_node_can_join_test_
* BOOT - node is shutdown with kill -9 during bootstrap, test name :
_failed_bootstap_wiped_node_can_join_test_
I think these are the essential tests, the only one missing is perhaps killing
the node after it starts gossiping and before it enters the bootstrap, but I am
not sure how easy it is as the python code must reliably kill the node at the
right time (with the bootstrap I was able to extend it by increasing the amount
of data streamed).
> checkForEndpointCollision fails for legitimate collisions
> ---------------------------------------------------------
>
> Key: CASSANDRA-9765
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Richard Low
> Assignee: Stefania
> Fix For: 2.0.17
>
>
> Since CASSANDRA-7939, checkForEndpointCollision no longer catches a
> legitimate collision. Without CASSANDRA-7939, wiping a node and starting it
> again fails with 'A node with address %s already exists', but with it the
> node happily enters joining state, potentially streaming from the wrong place
> and violating consistency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)