[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836370#comment-17836370 ] Cameron Zemek commented on CASSANDRA-18845: --- I have reworked the patch more so it a new method instead of modifying the existing waitToSettle. So it has the least change to any existing behavior. It directly called in MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping (since need nodes in UP state in order to get schema and stream sstables from). And just before enabling native transport. https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, > delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772415#comment-17772415 ] Cameron Zemek commented on CASSANDRA-18845: --- I have reworked the patch into pull request here: [Wait for live endpoints as part of waiting for gossip to settle by grom358 · Pull Request #2778 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2778]. Created the PR against 4.1 since 5.x is not as stable. Still have not got around to making an automated test for this yet. It has the following behaviors: * Must opt-in by setting cassandra.gossip_settle_wait_live_max * Waits up to maximum number of polls defined by cassandra.gossip_settle_wait_live_max . Set to -1 to wait indefinitely. * cassandra.skip_wait_for_gossip_to_settle still applies to cap the maximum number of polls. * cassandra.gossip_settle_wait_live_required determines how many polls in a row without change to live endpoint state to consider gossip as settled once opt-in via cassandra.gossip_settle_wait_live_max * If live endpoint size equals number of endpoints, consider live endpoints as settled. * Requires at least 1 other live endpoint to begin considering live endpoints as settled. Scenarios considered: * One node cluster. Will skip this check since epSize == liveSize * Entire cluster is down and starting up a node. Will wait cassandra.gossip_settle_wait_live_max polls * Restarting a node when another node is down. Will wait cassandra.gossip_settle_wait_live_required polls * On rare occasions it takes a while to see another node as UP. This is covered by requiring at least 1 other endpoint as up `liveSize > 1` to start the settlement process. Being opt-in, this doesn't break any existing tests. This is also easier to use then the reverted patch as you just need to set cassandra.gossip_settle_wait_live_max . To restate the purpose of this patch is to resolve Native-Transport-Request starting before Cassandra has finished ECHO requests to other nodes. This results in requests failing LOCAL_QUORUM/QUORUM consistency as the endpoints are not considered live for purposes of executing requests. This is coming up every time we are rolling restarting large clusters when doing security patches and other such operations. So typically, only allow a single node to be down at a time. With this Pull Request the waiting for live endpoints ends once all endpoints are UP and so this allows for minimizing time to perform rolling restarts while avoiding failed queries and affecting clients. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767418#comment-17767418 ] Stefan Miklosovic commented on CASSANDRA-18845: --- The "separate patch" makes sense to me. I like the fact that we are not changing what was there but we just add on top of that so the original logic is untouched. It would keep things as they were but you would be also covered if you have special requirements e.g you are waiting for all nodes to be marked as live so you have some level of certainty that CQL requests will not fail afterwards. I am still lacking a comprehensive test e.g. as in-jvm dtest. I could probably help you with that but I can imagine already that nailing down this scenario precisely and consistently so it is repeatable might be a little bit challenging. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767361#comment-17767361 ] Cameron Zemek commented on CASSANDRA-18845: --- {noformat} Sep 21 03:01:42 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 21 03:01:48 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:01:49 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:01:50 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO o.a.c.gms.GossipDigestAckVerbHandler Received a GossipDigestAckMessage from /15.223.140.86 Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /44.229.153.229 ... Sep 21 03:03:40 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper InetAddress /44.229.153.229 is now UP{noformat} Got a test run with 18 second delay. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767358#comment-17767358 ] Cameron Zemek commented on CASSANDRA-18845: --- {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80{noformat} I am struggling to reproduce this ^ I seen it twice, and after enabling more logging haven't been able to reproduce again. What I do sometimes see though it taking over 30 seconds to get the first ECHO response. Since there are dtests that rely on having CQL up while nodes are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that is opt-in. Having settle just check for currentLive == liveSize is still allowing NTR to start while nodes are marked down. Yes you can increase cassandra.gossip_settle_poll_success_required (and/or the other properties) to mitigate it but these increase the minimum startup time. Whereas [^18845-seperate.patch] doesn't add to this when the cluster is healthy. A more elaborate solution would be to specify the required consistency level. And for all token ranges owned by the node you check if you have the needed live endpoints to satisfy the consistency level. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767052#comment-17767052 ] Cameron Zemek commented on CASSANDRA-18845: --- with this removed {code:java} (epSize == liveSize || liveSize > 1){code} the j11_dtests just passed. [j11_dtests (120384) - instaclustr/cassandra (circleci.com)|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3180/workflows/2f7e6199-d865-4eee-a3b1-9511a4c88a45/jobs/120384] > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767024#comment-17767024 ] Stefan Miklosovic commented on CASSANDRA-18845: --- I am retracting my note about loopback addresses. It does call waitToSettle because (empirically tested) {code} if (!FBUtilities.getBroadcastAddressAndPort().equals(InetAddressAndPort.getLoopbackAddress())) Gossiper.waitToSettle(); {code} evaluates to "FBUtilities.getBroadcastAddressAndPort().equals(InetAddressAndPort.getLoopbackAddress())" being false. Which is true as broadcast is 127.0.0.2 and loopback is 127.0.0.1. So it does call waitToSettle. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767007#comment-17767007 ] Cameron Zemek commented on CASSANDRA-18845: --- [^stream.log] Without this patch I get nodes stuck in being unable to join large test cluster: {noformat} Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: INFO o.a.cassandra.service.StorageService JOINING: Starting to bootstrap... Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: Exception (java.lang.RuntimeException) encountered during startup: A node required to move the data consistently is down (/13.237.60.255). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: java.lang.RuntimeException: A node required to move the data consistently is down (/13.237.60.255). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294){noformat} The node is in endless restart cycle (since our service keeps retrying) with it reporting a different IP each time. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767125#comment-17767125 ] Brandon Williams commented on CASSANDRA-18845: -- I don't have time to look at this fully but one thing you may want to do is something I did on CASSANDRA-18792 to find the issue, which is add more debugging around the echoes and push it up to debug so I didn't have to cloud everything with TRACE. https://github.com/driftx/cassandra/commit/e1e6b1a0fb0dacc067ddc5910659e1fe6da2cd52 > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767037#comment-17767037 ] Cameron Zemek commented on CASSANDRA-18845: --- the {noformat} (epSize == liveSize || liveSize > 1){noformat} part breaks dtests. For example, {noformat} pytest --force-resource-intensive-tests --cassandra-dir=/home/grom/dev/cassandra materialized_views_test.py::TestMaterializedViews::test_throttled_partition_update{noformat} This test fails since it will shutdown a 5 node cluster and start/stop each node one at a time. And therefore liveSize > 1 is never true. Possible paths forward: # The check for waiting for other nodes is off by default and requries setting a system property. # Figure out why there this large delay between waitToSettle call and getting ECHO responses. # Have the tests override cassandra.skip_wait_for_gossip_to_settle # ?? Some other option haven't thought of yet. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766949#comment-17766949 ] Cameron Zemek commented on CASSANDRA-18845: --- Still running, but sharing the results so far: {noformat} $ pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup /home/grom/dtest/lib/python3.10/site-packages/ccmlib/common.py:773: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. return LooseVersion(match.group(1)) == test session starts ===platform linux -- Python 3.10.12, pytest-7.3.1, pluggy-1.0.0 rootdir: /home/grom/tmp/cassandra-dtest configfile: pytest.ini plugins: repeat-0.9.1, flaky-3.7.0, timeout-1.4.2 timeout: 900.0s timeout method: signal timeout func_only: False collected 500 itemstransient_replication_ring_test.py ... [ 11%] {noformat} > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766917#comment-17766917 ] Stefan Miklosovic commented on CASSANDRA-18845: --- I noticed now this in your comment above: _This is going to be very difficult todo. dtests setup clusters on loopback addresses and waitToSettle code path has a guard against it if using a loopback address. Also, the problems mostly become apparent with large clusters._ This is really true (1) Gossip.waitToSettle is called only in case it is not on loopback. Since our dtests are all on loopback (right?) I do not think that code was ever invoked during dtests so its revert was not necessary. _If I redo the patch and remove the changes to ECHO and show those tests do not have regression would this allow the ticket to move forward?_ I think that is reasonable, wdyt, [~brandon.williams]? I think that what was unfortunate was that we mixed flood of echos solution / change with waiting for at least one node to be up. I think that the waiting for at least 1 node can go in and we will focus on the echos separately. (1) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/CassandraDaemon.java#L400-L401 (2) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/BootStrapper.java#L213-L214 (3) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/BootStrapper.java#L235-L236 (4) https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/schema/MigrationCoordinator.java#L696-L697 > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766850#comment-17766850 ] Stefan Miklosovic commented on CASSANDRA-18845: --- [~cam1982] you can simulate lost echo even in a setup with 2 nodes. This is possible with in-jvm dtests, definitely. You can drop whole communication between nodes like this (1) (1) https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/AuthTest.java#L99-L101 > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766694#comment-17766694 ] Cameron Zemek commented on CASSANDRA-18845: --- [^delay.log] Attached a log from 105 node test cluster that shows the delay between starting to wait for gossip and getting replies back for UP . Snippet {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80 Sep 19 08:10:57 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper InetAddress /54.149.62.104 is now UP{noformat} So the delay is in sending out the Echo. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766677#comment-17766677 ] Cameron Zemek commented on CASSANDRA-18845: --- !test1.log|width=7,height=7,align=absmiddle! !test2.log|width=7,height=7,align=absmiddle! !test3.log|width=7,height=7,align=absmiddle! Tested the patch 3 times to confirm it working. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766628#comment-17766628 ] Cameron Zemek commented on CASSANDRA-18845: --- [Cassandra 18845 3.11 by grom358 · Pull Request #2701 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2701] [Cassandra 18845 4.0 by grom358 · Pull Request #2702 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2702] [Cassandra 18845 4.1 by grom358 · Pull Request #2703 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2703] [Cassandra 18845 5.0 by grom358 · Pull Request #2704 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2704] > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766192#comment-17766192 ] Cameron Zemek commented on CASSANDRA-18845: --- CASSANDRA-18543 had 3 components: # Allow for overriding the values used in waitToSettle # Make waitToSettle also consider the liveEndpoint members as part of settling. # Changes to handling of ECHO requests to remove duplicate inflight ECHO and duplicate log messages about the same node going into UP state 'is now UP' With the reverting in CASSANDRA-18854 did the changes to waitToSettle need to be reverted? The problem seems to be the changes to ECHO. > The next step for this ticket to move forward will be to create tests that > demonstrate the problem and guard against regressions. This is going to be very difficult todo. dtests setup clusters on loopback addresses and waitToSettle code path has a guard against it if using a loopback address. Also, the problems mostly become apparent with large clusters. If redo the patch and remove the changes to ECHO and show those tests do not have regression would this allow the ticket to move forward? I also in process of setting up a large test cluster. [^example.log] shows an example of what happens without the patched waitToSettle. Gossip settles before nodes have finished marked as UP. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765577#comment-17765577 ] Stefan Miklosovic commented on CASSANDRA-18845: --- That is unfortunate. We should probably focus more on finding out what is causing these long initial delays and how to remediate that rather than applying various band-aids on this (even though done in a good faith) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765558#comment-17765558 ] Brandon Williams commented on CASSANDRA-18845: -- CASSANDRA-18543 is going to be reverted for causing a regression. The next step for this ticket to move forward will be to create tests that demonstrate the problem and guard against regressions. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765429#comment-17765429 ] Cameron Zemek commented on CASSANDRA-18845: --- Need to-do more investigating around the slowness. I suspect its due to the flood of gossip messages on startup. The previous patch CASSANDRA-18543 removed the duplicate ECHO messages to cut down on this. The behavior I notice happening in production though is there a large initial delay (> 10 seconds) for any nodes to be marked as `is now UP` then it floods in. On large clusters this takes over a minute to complete receiving them all. Prior to CASSANDRA-18543 it never checked liveSize at all and so would start up regardless of UP status of nodes. With that change assuming the polling starts as UP status are received it waits. So the problem now is waiting for that initial event. The previous patch from CASSANDRA-18543 allowed for overriding the gossip parameters but in hindsight it's difficult to determine a suitable default for that initial wait as its not consistent. The algorithm in waitToSettle relies on seeing a change in these values, so that initial delay if greater than the wait time plus the polling phase will move on and start NTR even though we have yet to see any nodes as UP. You are correct that even with this proposed patch it's possible to still start NTR too early. Eg, if one node reports UP but the delay for the next event is longer than the polling period, but I am not seeing that in production so far. Therefore, the purpose of this patch is to have it wait for the first `is now UP` from a node instead of relying on cassandra.gossip_settle_min_wait_ms > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765035#comment-17765035 ] Stefan Miklosovic commented on CASSANDRA-18845: --- Interesting. I am curious what causes that initial delay. What you are saying is that it takes a lot of time for the nodes to be up and then it appears (from the log you posted) like all of them are reported more or less at the same time? There is an initial delay of dozes of seconds before it starts to get reported? If that is true then it probably makes sense to have a condition like that so we see at least some other nodes to be up to count it and increase numOkay. However, if we have this {code:java} if (currentSize == epSize && currentLive == liveSize && (epSize == liveSize || liveSize > 1)) {code} Then what if we have {code} currentSize = 2 , epSize = 2, currentLive = 2, liveSize = 2 {code} That "if" would return true, so numOkay would be increased and it would count it as a valid round. However, and it is a little bit hard to formulate it correctly, but is not it true that we are not guaranteeing that QUORUM would be satisfied here anyway? Because it could stay on all "twos" for all rounds and we would say that gossip settled while there is bunch of other nodes to be reported but they just have not make it and we were stuck on 2 for three rounds. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764934#comment-17764934 ] Cameron Zemek commented on CASSANDRA-18845: --- [~brandon.williams] [~smiklosovic] the existing conditions {noformat} currentSize == epSize && currentLive == liveSize{noformat} are what stops it starting Native Transport too early if gossip is still being updated (for example liveSize is changing). waitToSettle waits by default 5 seconds then it starts polling every 1 second 3 times seeing if either liveSize or epSize changes and resets its numOkay if either of these changes. The problem is when for example it took 79 seconds for that first change in liveSize, liveSize was constantly at 1 so it goes okay gossip is settled due to no changes in epSize or liveSize. The extra condition therefore is don't consider gossip settled if there only 1 live endpoint (the node itself). Unless it's a single node cluster (epSize == liveSize) > So when there is a cluster of 50 nodes, without this change, that "if" would > return false (or it would not return true fast enough to increment numOkay to > break from that while) as there would be new endpoints or live members > detected each round. To rephrase the problem is there is no new endpoints or live members changes. waitToSettle will consider it settled with liveSize == 1 currently. > why it takes almost minute and a half This is a good question but in general it takes quite awhile for gossip to complete on clusters with multiple datacenters and/or large number of nodes. I think that is a different much more complex JIRA. The purpose of the attached patch is so you don't need to guess what cassandra.gossip_settle_min_wait_ms to use. It waits for at least one node to report is now UP in order to increment numOkay and to continue with the rest of the waitToSettle logic. !image-2023-09-14-11-16-23-020.png! > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764726#comment-17764726 ] Stefan Miklosovic commented on CASSANDRA-18845: --- Yeah, like ... if there is 20 nodes, RF is 5 and QUORUM is 3, then "liveSize > 1" is at least 2. But how do we know that these "2" satisfy _each query on local quorum_ ? Maybe there is a query for which quorum requires such nodes live which are not detected yet, or maybe I am missing something here. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764724#comment-17764724 ] Brandon Williams commented on CASSANDRA-18845: -- bq. that we consider the gossip to be settled as soon as there is more than 1 live endpoint That would seem to cause: bq. do not want to start Native Transport until gossip settles otherwise queries can fail consistency such as LOCAL_QUORUM > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764707#comment-17764707 ] Stefan Miklosovic commented on CASSANDRA-18845: --- I as Iooked at this closer I realized I do not understand it either. If we want to add this: {code:java} if (currentSize == epSize && currentLive == liveSize && (epSize == liveSize || liveSize > 1)) {code} When it was like this: {code:java} if (currentSize == epSize && currentLive == liveSize) {code} That basically means, if I generalize that, that we consider the gossip to be settled as soon as there is more than 1 live endpoints detected? So when there is a cluster of 50 nodes, without this change, that "if" would return false (or it would not return true fast enough to increment numOkay to break from that while) as there would be new endpoints or live members detected each round. But the question is, as Brandon mentioned that, why it takes almost minute and a half? > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764688#comment-17764688 ] Brandon Williams commented on CASSANDRA-18845: -- I'm not sure I understand what the problem is. bq. On a node just observed a 79 second gap between waiting for gossip and the first echo response to indicate a node is UP. It seems like the reason for this is not in the code. What made the echo response take so long? > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764515#comment-17764515 ] Stefan Miklosovic commented on CASSANDRA-18845: --- I instructed Cameron privately about strong preference for an in-jvm dtest to verify and test this behavior. Looking at the test steps described in his comment about, it should be rather straightforward to come up with one. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764467#comment-17764467 ] Cameron Zemek commented on CASSANDRA-18845: --- I have attached patched. Tested this as follows: # Spin up single node cluster. Works due to epSize == liveSize check that lets it bypass the liveSize > 1 check # Spin up 3 node cluster. All 3 nodes start up NTR as expected. # Shutdown all nodes. Start up first node it stays waiting in gossip due to the liveSize > 1 requirement # Start up second node. Now both nodes start NTR since liveSize > 1 and there are no other incoming `is now UP` events so gossip looks settled. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org