[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes

Richard Low (JIRA) Wed, 18 Jun 2014 13:31:27 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036328#comment-14036328
 ]


Richard Low commented on CASSANDRA-7307:
----------------------------------------

bq. For bootstrap? Let me be clear, the problem with replace is not related to 
streaming. It's refusing to replace a live node, because the FD takes so long 
to report it as down upon first discovery.

Actually, most of the time the problem is streaming. It is happy during 
replacement (which surprises me, since it clearly lists it as UP), but then 
requests to stream from the dead node which fails. We've seen this where it 
happily streams from other nodes, but then ultimately fails because the stream 
from the dead node fails.

However, we also see a problem where it fails to replace because it thinks the 
node is live. This happens less often but I expect has the same root cause.


> New nodes mark dead nodes as up for 10 minutes
> ----------------------------------------------
>
>                 Key: CASSANDRA-7307
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7307
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Richard Low
>            Assignee: Brandon Williams
>             Fix For: 1.2.17, 2.0.9, 2.1 rc2
>
>
> When doing a node replacement when other nodes are down we see the down nodes 
> marked as up for about 10 minutes. This means requests are routed to the dead 
> nodes causing timeouts. It also means replacing a node when multiple nodes 
> from a replica set is extremely difficult - the node usually tries to stream 
> from a dead node and the replacement fails.
> This isn't limited to host replacement. I did a simple test:
> 1. Create a 2 node cluster
> 2. Kill node 2
> 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I 
> don't think this is significant)
> The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes:
> {code}
> INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging 
> initialized
> INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node 
> /127.0.0.2 is now part of the cluster
> INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) 
> InetAddress /127.0.0.2 is now UP
> INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) 
> InetAddress /127.0.0.2 is now DOWN
> {code}
> I reproduced on 1.2.15 and 1.2.16.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes

Reply via email to