[ 
https://issues.apache.org/jira/browse/CASSANDRA-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-10134:
----------------------------------------
    Status: Patch Available  (was: In Progress)

Sorry this has dragged on, I got bogged down with a few other things...

The linked branch modifies {{StorageService::prepareToJoin}} to always perform 
a collision check when not replacing and adds the ability to replace without 
bootstrap. To summarise the points from the comments above, and how they're 
addressed in the branch:
    * Seed nodes will perform a shadow round, which they're unable to exit when 
all seeds are started concurrently. The patch addresses this by modifying the 
response to a shadow digest syn. If a node receiving a shadow syn is itself in 
a shadow round, rather than ignoring it, it responds with a minimal ack with 
both the digest list and state map empty.This indicates to the node sending the 
syn that the seed is in its shadow round. The syn-sending node is permitted to 
exit its shadow round if it receives a regular ack (current behaviour) or if 
all seeds are found to be in a shadow round.
    * Whilst this is beneficial when all seeds are started concurrently, making 
that a mandatory requirement is, I feel, going to be overly burdensome for 
operators. In a cluster with > 1 seed, when all seeds are down the only options 
are to restart all seeds concurrently or to modify the seed list on one, 
restart that, then bring the others up before finally restoring the first 
seed's list and restarting it again. To fix this, the patch uses [~Stefania]'s 
approach and ultimately makes the shadow round a best effort. Note that this is 
for nodes which appear in their own seed list only, any node which doesn't 
consider itself a seed with fail to startup if it cannot complete a shadow 
round. I've also added a property, {{cassandra.allow_unsafe_join}} to skip the 
shadow round completely for use in testing. 
    * The collision check itself needs to be extended, as it's no longer 
performed only for bootstrapping nodes. When a node is not bootstrapping, we 
need to verify that its address is not already associated with another host id. 
When the node is bootstrapping, behaviour remains the same as before. That is, 
any previous status for the endpoint is retrieved from the shadow round & 
tested against a blacklist of disallowed previous states. 
    * As [~pauloricardomg] noted, a side effect is to disallow the unsafe, 
ghetto-replace approach to dealing with scenarios such as a JBOD disk failure. 
With that in mind patch decouples bootstrap and replacement, so that the 
combination of {{replace_address(_first_boot)}} and {{auto_bootstrap=false}} is 
permitted. As this is a "genuine (but unsafe)" scenario, I've added startup 
flag to ensure that operators are cognizant of the risk involved in doing so. A 
benefit of this over the documented approach of manually setting initial tokens 
in yaml is that the replacement tokens are retrieved from gossip, reducing the 
chance for operator error. Performing this non-bootstrapping replace requires 
{{-Dcassandra.allow_unsafe_replace=true}} at startup.

The best-effort-only approach to the shadow round for seed nodes clearly 
undermines the safety benefits of all this somewhat, raising the question of 
whether it's actually worth making the modifications to Syn & Ack handling. In 
my opinion, those changes are fairly minimal & don't make things significantly 
harder to reason about, so I'm ok with including those but I won't argue too 
strongly if dropping them is suggested. 

[One of the utest 
failures|http://cassci.datastax.com/job/beobal-10134-trunk-testall/2/testReport/org.apache.cassandra.service/RemoveTest/testLocalHostId_compression/]
 looks particularly suspect. My suspicion is that something is already racy 
with the test setup as I can find the exact same failure in recent-ish [another 
run|http://cassci.datastax.com/job/trunk_testall/806/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/].
 So far I've been unable to figure out exactly what the problem is by 
inspection & I haven't been able to repro the failure in > 100 test runs 
locally. 

There were also some dtest failures on the earlier runs, caused by a bug which 
I've since fixed & verified locally (CI is pending).

I've pushed a dtest branch containing [~thobbs]' original new test, plus 
another for the unsafe replace operation 
[here|https://github.com/beobal/cassandra-dtest/tree/10134]. A follow up will 
be to identify which tests are adversely affected by the mandatory shadow round 
(I know that {{ttl_test}} definitely is), then have them skip it if appropriate.

||branch||testall||dtest||
|[10134-trunk|https://github.com/beobal/cassandra/tree/10134-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-10134-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/beobal/job/beobal-10134-trunk-dtest]|


> Always require replace_address to replace existing address
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-10134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10134
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Distributed Metadata
>            Reporter: Tyler Hobbs
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.x
>
>
> Normally, when a node is started from a clean state with the same address as 
> an existing down node, it will fail to start with an error like this:
> {noformat}
> ERROR [main] 2015-08-19 15:07:51,577 CassandraDaemon.java:554 - Exception 
> encountered during startup
> java.lang.RuntimeException: A node with address /127.0.0.3 already exists, 
> cancelling join. Use cassandra.replace_address if you want to replace this 
> node.
>       at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543)
>  ~[main/:na]
>       at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:783)
>  ~[main/:na]
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:720)
>  ~[main/:na]
>       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:611)
>  ~[main/:na]
>       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) 
> [main/:na]
>       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:537)
>  [main/:na]
>       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:626) 
> [main/:na]
> {noformat}
> However, if {{auto_bootstrap}} is set to false or the node is in its own seed 
> list, it will not throw this error and will start normally.  The new node 
> then takes over the host ID of the old node (even if the tokens are 
> different), and the only message you will see is a warning in the other 
> nodes' logs:
> {noformat}
> logger.warn("Changing {}'s host ID from {} to {}", endpoint, storedId, 
> hostId);
> {noformat}
> This could cause an operator to accidentally wipe out the token information 
> for a down node without replacing it.  To fix this, we should check for an 
> endpoint collision even if {{auto_bootstrap}} is false or the node is a seed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to