[ 
https://issues.apache.org/jira/browse/CASSANDRA-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781814#comment-15781814
 ] 

Jay Zhuang commented on CASSANDRA-12172:
----------------------------------------

We saw similar issue but only when the bootstrapping is interrupted. For a 
brand new node, bootstrap works fine. But if it's interrupted by any reason, 
and restarted again, we saw this issue: {{A node required to move the data 
consistently is down (/IP)}}. It could be reproduced by killing a {{UJ}} node 
and re-start it again.

For us, the workaround is either deleting the data (then bootstrap again), or 
increasing the 
[{{ring_delay_ms}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L122].
 And the larger the cluster is, the longer {{ring_delay_ms}} is needed. Based 
on our tests, for a 40 nodes cluster, it requires {{ring_delay_ms}} to be 
>50seconds. For a 70 nodes cluster, >100seconds. Default is 30seconds.

I guess the problem maybe because when 
[{{addSavedEndpoint}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1396],
 the initial status are marked as 
[{{dead}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1416],
 it took time for large cluster to mark all nodes to live. Especially for when 
messagingService version after 
[{{VERSION_20}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L961],
 which 
[{{sendRR()}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L984]
 to check.

A simple fix would be set the {{ring_delay_ms}} based on the number of nodes.

> Fail to bootstrap new node.
> ---------------------------
>
>                 Key: CASSANDRA-12172
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12172
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Dikang Gu
>
> When I try to bootstrap new node in the cluster, sometimes it failed because 
> of following exceptions.
> {code}
> 2016-07-12_05:14:55.58509 INFO  05:14:55 [main]: JOINING: Starting to 
> bootstrap...
> 2016-07-12_05:14:56.07491 INFO  05:14:56 [GossipTasks:1]: InetAddress 
> /2401:db00:2011:50c7:face:0:9:0 is now DOWN
> 2016-07-12_05:14:56.32219 Exception (java.lang.RuntimeException) encountered 
> during startup: A node required to move the data consistently is down 
> (/2401:db00:2011:50c7:face:0:9:0). If you wish to move the data from a 
> potentially inconsis
> tent replica, restart the node with -Dcassandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32582 ERROR 05:14:56 [main]: Exception encountered during 
> startup
> 2016-07-12_05:14:56.32583 java.lang.RuntimeException: A node required to move 
> the data consistently is down (/2401:db00:2011:50c7:face:0:9:0). If you wish 
> to move the data from a potentially inconsistent replica, restart the node 
> with -Dc
> assandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32584       at 
> org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:264)
>  ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at 
> org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:147) 
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at 
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:82) 
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at 
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1230)
>  ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:924)
>  ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:709)
>  ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
>  ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) 
> [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586       at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
>  [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586       at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) 
> [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32730 WARN  05:14:56 [StorageServiceShutdownHook]: No 
> local state or state is in silent shutdown, not announcing shutdown
> {code}
> Here are more logs: 
> https://gist.github.com/DikangGu/c6a83eafdbc091250eade4a3bddcc40b
> I'm pretty sure there are no DOWN nodes or restarted nodes in the cluster, 
> but I still see a lot of nodes UP and DOWN in the gossip log, which failed 
> the bootstrap at the end, is this a known bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to