[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes

Richard Low (JIRA) Thu, 05 Jun 2014 01:26:23 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018575#comment-14018575
 ]


Richard Low commented on CASSANDRA-7307:
----------------------------------------

I don't know about the FD params so can't comment on that, but my test case 
shows it when growing a 2 node cluster to 3 nodes. We also saw it in prod in a 
larger cluster when doing node replacement. So it shows in both small and large 
clusters.

bq. but isn't adding capacity when things are up, more common than when things 
are down?

True, but replacing nodes when nodes are down is common. That's why we really 
care about it - currently if there are 2 nodes down from a replica set you 
can't use -Dcassandra.replace_address because the replacement node tries to 
stream from the dead one.

bq. What do you get for 10s / 5s, Richard Low

I will test locally but harder to test on a large cluster. Brandon suggested 1 
second when doing replacement.

NB on 1.1 we had no problems with this so the initial conditions used to be 
better, at least when doing replacement.

> New nodes mark dead nodes as up for 10 minutes
> ----------------------------------------------
>
>                 Key: CASSANDRA-7307
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7307
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Richard Low
>            Assignee: Brandon Williams
>             Fix For: 1.2.17
>
>
> When doing a node replacement when other nodes are down we see the down nodes 
> marked as up for about 10 minutes. This means requests are routed to the dead 
> nodes causing timeouts. It also means replacing a node when multiple nodes 
> from a replica set is extremely difficult - the node usually tries to stream 
> from a dead node and the replacement fails.
> This isn't limited to host replacement. I did a simple test:
> 1. Create a 2 node cluster
> 2. Kill node 2
> 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I 
> don't think this is significant)
> The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes:
> {code}
> INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging 
> initialized
> INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node 
> /127.0.0.2 is now part of the cluster
> INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) 
> InetAddress /127.0.0.2 is now UP
> INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) 
> InetAddress /127.0.0.2 is now DOWN
> {code}
> I reproduced on 1.2.15 and 1.2.16.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes

Reply via email to