[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784759#comment-17784759 ] Stefan Miklosovic commented on CASSANDRA-18968: --- I evaluated the failures as flakiness. These tests are passing locally. We are good to merge. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784400#comment-17784400 ] Brandon Williams commented on CASSANDRA-18968: -- Yeah, those failures looked possibly environmental to me, let's doublecheck. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784286#comment-17784286 ] Stefan Miklosovic commented on CASSANDRA-18968: --- Well, I think we have a problem: https://app.circleci.com/pipelines/github/driftx/cassandra/1363/workflows/5fafb76b-683f-43a6-9803-5e67161bac4c/jobs/61070 > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784225#comment-17784225 ] Paulo Motta commented on CASSANDRA-18968: - bq. Whole "waiting for gossip to settle" machinery is ... not ideal. Yes, it works in most of the situations but there are edge cases when it does not, e.g. when there are large clusters, it may happen that it may evaluate that gossip is "settled" falsely because it took so much time to detect any changes that it was thinking it is settled. I'm aware waitToSettle is not reliable. Nevertheless I think having a "best-effort" skipping of this check when 3.X nodes are detected in gossip is valuable. This will mostly work as long as gossip with a single node was successful, since it will get the latest known versions of the other nodes. In the case where the gossip information is absent and there are 3.X nodes present in the cluster, it's not a big deal - the check will just be executed and the timeout warning above will be unnecessarily emitted. We just don't want to skip this check when *all nodes are upgraded to 4.x* but I don't think this would happen if waitToSettle fails. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784224#comment-17784224 ] Brandon Williams commented on CASSANDRA-18968: -- I think it would make a lot of sense to run the upgrade tests here. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784219#comment-17784219 ] Stefan Miklosovic commented on CASSANDRA-18968: --- [~paulo] well ... the waiting for a gossip as you shown that method has its own set of problems (1) my colleague recently hit and tried to solve it but we had to revert that, unfortunately. It is quite a read though (2), (2) is the followup ticket which was trying to improve (1) but we never got there because we reverted it. Whole "waiting for gossip to settle" machinery is ... not ideal. Yes, it works in most of the situations but there are edge cases when it does not, e.g. when there are large clusters, it may happen that it may evaluate that gossip is "settled" falsely because it took so much time to detect any changes that it was thinking it is settled. (1) https://issues.apache.org/jira/browse/CASSANDRA-18543 (2) https://issues.apache.org/jira/browse/CASSANDRA-18845 > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784212#comment-17784212 ] Paulo Motta commented on CASSANDRA-18968: - It seems like this issue was raised by [~aleksey] on CASSANDRA-13993: {quote}As implemented currently, we are going to send PINGs potentially to 3.11/3.0 - unless we switch to gating by version, which we do sometimes. {quote} {quote}So I was thinking about a major upgrade bounce scenario. Think the first ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to every node, but receive no pongs, correct? So every node until a threshold will have a significantly longer bounce. Do we care about this case? {quote} Which was replied by [~jasobrown] with: {quote}So here's the rub: we don't necessarily know the peer's version yet. The ping messages are sent on the large/small connections, but we're not guaranteed that at least one round of gossip has completed wherein we would learn the version of the peers (we're still at in the startup process). {quote} However I don't think this is a problem since we [wait for gossip to settle|https://github.com/apache/cassandra/blob/7b891db36d4bcfa116ee04e3f4b3f31af798d5b2/src/java/org/apache/cassandra/service/CassandraDaemon.java#L401] before executing this check? Can you confirm this [~brandon.williams] ? The worst that can happen if the version of a peer is unknown is to unnecessarily execute this check which will just fallback to the current behavior which is not a big deal IMO - it will just make startup slightly slower and log a warning. Confirmed that when upgrading a cluster from 3.11 to 4.1 the following message is print on the debug.log for all except the last node: {noformat} DEBUG [main] 2023-11-08 21:08:42,056 StartupClusterConnectivityChecker.java:97 - Skipping startup connectivity check as some nodes may be running Cassandra version 3 or older which does not support connectivity checking. {noformat} In the last node to be upgraded the check is executed as expected: {noformat} INFO [main] 2023-11-08 21:35:30,387 StartupClusterConnectivityChecker.java:128 - Blocking coordination until only a single peer is DOWN in the local datacenter, timeout=10s INFO [main] 2023-11-08 21:35:30,453 StartupClusterConnectivityChecker.java:181 - Ensured sufficient healthy connections with [DC1] after 63 milliseconds {noformat} {quote}+1, I am waiting for another formal +1 and we can ship this! {quote} LGTM, would you like to commit this [~smiklosovic] ? If so perhaps update the CHANGES.txt and commit message to "Skip connectivity check when upgrading from 3.X". > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783911#comment-17783911 ] Stefan Miklosovic commented on CASSANDRA-18968: --- [4.0 j11-precommit|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3457/workflows/10731e97-21b1-4846-9450-800f8dc0bc05] [4.0 j8-precommit|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3457/workflows/b37d8e69-cc51-48ea-9d12-75e493eb1351] [4.1 j11-precommit|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3458/workflows/122fcd49-3e1e-494b-8949-b75bf2b9f6af] [4.1 j8-precommit|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3458/workflows/5539e1f8-ec73-4201-9ef4-6c12f9858aa9] +1, I am waiting for another format +1 and we can ship this! these were the squashed branches I was running CI for: [4.0|https://github.com/instaclustr/cassandra/commits/CASSANDRA-18968-4.0] [4.1|https://github.com/instaclustr/cassandra/commits/CASSANDRA-18968-4.1] > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783834#comment-17783834 ] Isaac Reath commented on CASSANDRA-18968: - Yeah just 4.0 and 4.1. Afaik 5.0 isn't supposed to support upgrading from 3.x I think we're ok. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Fix For: 4.0.x, 4.1.x > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783684#comment-17783684 ] Stefan Miklosovic commented on CASSANDRA-18968: --- This goes just to 4.0 and 4.1, correct? [~isaacreath] [~paulo] > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Time Spent: 1h 50m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782742#comment-17782742 ] Isaac Reath commented on CASSANDRA-18968: - Updated for 4.0 (https://github.com/apache/cassandra/pull/2863) and 4.1 (https://github.com/apache/cassandra/pull/2862) > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Assignee: Isaac Reath >Priority: Normal > Labels: lhf > Time Spent: 10m > Remaining Estimate: 0h > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780532#comment-17780532 ] Brandon Williams commented on CASSANDRA-18968: -- bq. this does not fail startup, it just unnecessarily emits a misleading warning. It's pretty harmless. Ah, I see, that explains why tests pass. bq. We could add a verbhandler to 3.x but at this stage I think it's easier/simpler to just version gate this feature I agree. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Priority: Normal > Labels: lhf > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780515#comment-17780515 ] Paulo Motta commented on CASSANDRA-18968: - It looks like support to PING was added to 3.x on CASSANDRA-14447 but [no verb handler was implemented|https://github.com/jasobrown/cassandra/commit/795de31194f1109490d759d8b339efcf13118971] which explains why there are no acknowledgements. We could add a verbhandler to 3.x but at this stage I think it's easier/simpler to just version gate this feature to fully upgraded 4.x clusters. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Priority: Normal > Labels: lhf > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780509#comment-17780509 ] Paulo Motta commented on CASSANDRA-18968: - Hi [~brandon.williams] - this does not fail startup, it just unnecessarily emits a misleading warning. It's pretty harmless. I'm not familiar with this patch but I suspect 3.x nodes do not have the {{PING_REQ}} message [sent here|https://github.com/apache/cassandra/blob/f8c240147c307bf5c527ff3a34e3c0f3043b7e9c/src/java/org/apache/cassandra/net/StartupClusterConnectivityChecker.java#L143] so the connectivity checker does not get the necessary number of acknowledgements (3) to succeed. If this is the case, I think the easiest fix is just to disable the check if there are 3.X nodes, so the warning is not unnecessarily emitted. > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Priority: Normal > Labels: lhf > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X
[ https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780500#comment-17780500 ] Brandon Williams commented on CASSANDRA-18968: -- Hmm, are you sure about this? The upgrade tests have some environmental flakiness, but generally have been passing: https://ci-cassandra.apache.org/job/Cassandra-4.0/lastSuccessfulBuild/testReport/dtest-upgrade.upgrade_tests.upgrade_through_versions_test/ > StartupClusterConnectivityChecker fails on upgrade from 3.X > --- > > Key: CASSANDRA-18968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18968 > Project: Cassandra > Issue Type: Bug > Components: Local/Startup and Shutdown >Reporter: Paulo Motta >Priority: Normal > Labels: lhf > > Starting up a new 4.X node on a 3.x cluster throws the following warning: > {noformat} > WARN [main] 2023-10-27 15:58:22,234 > StartupClusterConnectivityChecker.java:183 - Timed out after 10002 > milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, > A.B.C.D]} > {noformat} > I think this is because the PING messages used by the startup check are not > available on 3.X. > To provide a smoother upgrade experience we should probably disable this > check on a mixed version clusters, or skip peers on versions < 4.x when doing > the connectivity check. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org