[
https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036251#comment-14036251
]
Brandon Williams commented on CASSANDRA-7307:
---------------------------------------------
bq. I don't think there's going to be a happy medium where we avoid both false
positives and false negatives
I agree, but I'd rather err on the side of the more common case, and let the
more exotic cases override the initial value if they need it. 5s still puts us
outside of the range for the replace_address check to make sure the node it's
replacing is dead, before it even begins streaming (though you'd possibly have
streaming problems during bootstrap in this scenario as well) and retry looping
would result in an endless loop for the scenario that check is designed to
catch. 3s puts us just inside the brink.
> New nodes mark dead nodes as up for 10 minutes
> ----------------------------------------------
>
> Key: CASSANDRA-7307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7307
> Project: Cassandra
> Issue Type: Bug
> Reporter: Richard Low
> Assignee: Brandon Williams
> Fix For: 1.2.17
>
>
> When doing a node replacement when other nodes are down we see the down nodes
> marked as up for about 10 minutes. This means requests are routed to the dead
> nodes causing timeouts. It also means replacing a node when multiple nodes
> from a replica set is extremely difficult - the node usually tries to stream
> from a dead node and the replacement fails.
> This isn't limited to host replacement. I did a simple test:
> 1. Create a 2 node cluster
> 2. Kill node 2
> 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I
> don't think this is significant)
> The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes:
> {code}
> INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging
> initialized
> INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node
> /127.0.0.2 is now part of the cluster
> INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809)
> InetAddress /127.0.0.2 is now UP
> INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823)
> InetAddress /127.0.0.2 is now DOWN
> {code}
> I reproduced on 1.2.15 and 1.2.16.
--
This message was sent by Atlassian JIRA
(v6.2#6252)