[
https://issues.apache.org/jira/browse/CASSANDRA-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781814#comment-15781814
]
Jay Zhuang commented on CASSANDRA-12172:
----------------------------------------
We saw similar issue but only when the bootstrapping is interrupted. For a
brand new node, bootstrap works fine. But if it's interrupted by any reason,
and restarted again, we saw this issue: {{A node required to move the data
consistently is down (/IP)}}. It could be reproduced by killing a {{UJ}} node
and re-start it again.
For us, the workaround is either deleting the data (then bootstrap again), or
increasing the
[{{ring_delay_ms}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L122].
And the larger the cluster is, the longer {{ring_delay_ms}} is needed. Based
on our tests, for a 40 nodes cluster, it requires {{ring_delay_ms}} to be
>50seconds. For a 70 nodes cluster, >100seconds. Default is 30seconds.
I guess the problem maybe because when
[{{addSavedEndpoint}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1396],
the initial status are marked as
[{{dead}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1416],
it took time for large cluster to mark all nodes to live. Especially for when
messagingService version after
[{{VERSION_20}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L961],
which
[{{sendRR()}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L984]
to check.
A simple fix would be set the {{ring_delay_ms}} based on the number of nodes.
> Fail to bootstrap new node.
> ---------------------------
>
> Key: CASSANDRA-12172
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12172
> Project: Cassandra
> Issue Type: Bug
> Reporter: Dikang Gu
>
> When I try to bootstrap new node in the cluster, sometimes it failed because
> of following exceptions.
> {code}
> 2016-07-12_05:14:55.58509 INFO 05:14:55 [main]: JOINING: Starting to
> bootstrap...
> 2016-07-12_05:14:56.07491 INFO 05:14:56 [GossipTasks:1]: InetAddress
> /2401:db00:2011:50c7:face:0:9:0 is now DOWN
> 2016-07-12_05:14:56.32219 Exception (java.lang.RuntimeException) encountered
> during startup: A node required to move the data consistently is down
> (/2401:db00:2011:50c7:face:0:9:0). If you wish to move the data from a
> potentially inconsis
> tent replica, restart the node with -Dcassandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32582 ERROR 05:14:56 [main]: Exception encountered during
> startup
> 2016-07-12_05:14:56.32583 java.lang.RuntimeException: A node required to move
> the data consistently is down (/2401:db00:2011:50c7:face:0:9:0). If you wish
> to move the data from a potentially inconsistent replica, restart the node
> with -Dc
> assandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32584 at
> org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:264)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584 at
> org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:147)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584 at
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:82)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584 at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1230)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584 at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:924)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585 at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:709)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585 at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
> ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585 at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300)
> [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586 at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
> [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586 at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625)
> [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32730 WARN 05:14:56 [StorageServiceShutdownHook]: No
> local state or state is in silent shutdown, not announcing shutdown
> {code}
> Here are more logs:
> https://gist.github.com/DikangGu/c6a83eafdbc091250eade4a3bddcc40b
> I'm pretty sure there are no DOWN nodes or restarted nodes in the cluster,
> but I still see a lot of nodes UP and DOWN in the gossip log, which failed
> the bootstrap at the end, is this a known bug?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)