[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

Cameron Zemek (Jira) Wed, 20 Sep 2023 07:44:33 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767007#comment-17767007
 ]


Cameron Zemek commented on CASSANDRA-18845:
-------------------------------------------

[^stream.log] Without this patch I get nodes stuck in being unable to join 
large test cluster:
{noformat}
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: INFO  
o.a.cassandra.service.StorageService JOINING: Starting to bootstrap...
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: Exception 
(java.lang.RuntimeException) encountered during startup: A node required to 
move the data consistently is down (/13.237.60.255). If you wish to move the 
data from a potentially inconsistent replica, restart the node with 
-Dcassandra.consistent.rangemovement=false
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: java.lang.RuntimeException: A 
node required to move the data consistently is down (/13.237.60.255). If you 
wish to move the data from a potentially inconsistent replica, restart the node 
with -Dcassandra.consistent.rangemovement=false
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]:         at 
org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294){noformat}
The node is in endless restart cycle (since our service keeps retrying) with it 
reporting a different IP each time. 

> Waiting for gossip to settle on live endpoints
> ----------------------------------------------
>
>                 Key: CASSANDRA-18845
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Cameron Zemek
>            Priority: Normal
>         Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

Reply via email to