[
https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784212#comment-17784212
]
Paulo Motta commented on CASSANDRA-18968:
-----------------------------------------
It seems like this issue was raised by [~aleksey] on CASSANDRA-13993:
{quote}As implemented currently, we are going to send PINGs potentially to
3.11/3.0 - unless we switch to gating by version, which we do sometimes.
{quote}
{quote}So I was thinking about a major upgrade bounce scenario. Think the first
ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to
every node, but receive no pongs, correct? So every node until a threshold will
have a significantly longer bounce. Do we care about this case?
{quote}
Which was replied by [~jasobrown] with:
{quote}So here's the rub: we don't necessarily know the peer's version yet. The
ping messages are sent on the large/small connections, but we're not guaranteed
that at least one round of gossip has completed wherein we would learn the
version of the peers (we're still at in the startup process).
{quote}
However I don't think this is a problem since we [wait for gossip to
settle|https://github.com/apache/cassandra/blob/7b891db36d4bcfa116ee04e3f4b3f31af798d5b2/src/java/org/apache/cassandra/service/CassandraDaemon.java#L401]
before executing this check? Can you confirm this [~brandon.williams] ?
The worst that can happen if the version of a peer is unknown is to
unnecessarily execute this check which will just fallback to the current
behavior which is not a big deal IMO - it will just make startup slightly
slower and log a warning.
Confirmed that when upgrading a cluster from 3.11 to 4.1 the following message
is print on the debug.log for all except the last node:
{noformat}
DEBUG [main] 2023-11-08 21:08:42,056 StartupClusterConnectivityChecker.java:97
- Skipping startup connectivity check as some nodes may be running Cassandra
version 3 or older which does not support connectivity checking.
{noformat}
In the last node to be upgraded the check is executed as expected:
{noformat}
INFO [main] 2023-11-08 21:35:30,387 StartupClusterConnectivityChecker.java:128
- Blocking coordination until only a single peer is DOWN in the local
datacenter, timeout=10s
INFO [main] 2023-11-08 21:35:30,453 StartupClusterConnectivityChecker.java:181
- Ensured sufficient healthy connections with [DC1] after 63 milliseconds
{noformat}
{quote}+1, I am waiting for another formal +1 and we can ship this!
{quote}
LGTM, would you like to commit this [~smiklosovic] ? If so perhaps update the
CHANGES.txt and commit message to "Skip connectivity check when upgrading from
3.X".
> StartupClusterConnectivityChecker fails on upgrade from 3.X
> -----------------------------------------------------------
>
> Key: CASSANDRA-18968
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18968
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Startup and Shutdown
> Reporter: Paulo Motta
> Assignee: Isaac Reath
> Priority: Normal
> Labels: lhf
> Fix For: 4.0.x, 4.1.x
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> Starting up a new 4.X node on a 3.x cluster throws the following warning:
> {noformat}
> WARN [main] 2023-10-27 15:58:22,234
> StartupClusterConnectivityChecker.java:183 - Timed out after 10002
> milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W,
> A.B.C.D]}
> {noformat}
> I think this is because the PING messages used by the startup check are not
> available on 3.X.
> To provide a smoother upgrade experience we should probably disable this
> check on a mixed version clusters, or skip peers on versions < 4.x when doing
> the connectivity check.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]