[ https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794439#comment-17794439 ]
Brandon Williams commented on CASSANDRA-19178: ---------------------------------------------- Yes, you can make the seed provider reload the seeds with 'nodetool reloadseeds' > Cluster upgrade 3.x -> 4.x fails due to IP change > ------------------------------------------------- > > Key: CASSANDRA-19178 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19178 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Aldo > Priority: Normal > Attachments: cassandra7.downgrade.log, cassandra7.log > > > I have a Docker swarm cluster with 3 distinct Cassandra services (named > {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different > servers. The 3 services are running the version 3.11.16, using the official > Cassandra image 3.11.16 on Docker Hub. The first service is configured just > with the following environment variables > {code:java} > CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7" > CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code} > which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance > the _cassandra.yaml_ for the first service contains the following (and the > rest is the image default): > {code:java} > # grep tasks /etc/cassandra/cassandra.yaml > - seeds: "tasks.cassandra7,tasks.cassandra9" > listen_address: tasks.cassandra7 > broadcast_address: tasks.cassandra7 > broadcast_rpc_address: tasks.cassandra7 {code} > Other services (8 and 9) have a similar configuration, obviously with a > different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and > {{{}tasks.cassandra9{}}}). > The cluster is running smoothly and all the nodes are perfectly able to > rejoin the cluster whichever event occurs, thanks to the Docker Swarm > {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for > Docker swarm to restart it, force update it in order to force a restart, > scale to 0 and then 1 the service, restart an entire server, turn off and > then turn on all the 3 servers. Never found an issue on this. > I also just completed a full upgrade of the cluster from version 2.2.8 to > 3.11.16 (simply upgrading the Docker official image associated with the > services) without issues. I was also able, thanks to a 2.2.8 snapshot on each > server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I > finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables > have now the {{me-*}} prefix. > > The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The > procedure that I follow is very simple: > # I start from the _cassandra7_ service (which is a seed node) > # {{nodetool drain}} > # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log > # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version > The procedure is exactly the same I followed for the upgrade 2.2.8 --> > 3.11.16, obviously with a different version at step 4. Unfortunately the > upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and > attempts to communicate with the other seed node ({_}cassandra9{_}) but the > log of _cassandra7_ shows the following: > {code:java} > INFO [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 > OutboundConnectionInitiator.java:390 - Failed to connect to peer > tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000) > io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: > Connection reset by peer{code} > The relevant port of the log, related to the missing internode communication, > is attached in _cassandra7.log_ > In the log of _cassandra9_ there is nothing after the abovementioned step #4. > So only _cassandra7_ is saying something in the logs. > I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is > always the same. Of course when I follow the steps 1..3, then restore the 3.x > snapshot and finally perform the step #4 using the official 3.11.16 version > the node 7 restarts correctly and joins the cluster. I attached the relevant > part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that > node 7 and 9 can communicate. > I suspect this could be related to the port 7000 now (with Cassandra 4.x) > supporting both encrypted and unencrypted traffic. As stated previously I'm > using the untouched official Cassandra images so all my cluster, inside the > Docker Swarm, is not (and has never been) configured with encryption. > I can also add the following: if I perform the 4 above steps also for the > _cassandra9_ and _cassandra8_ services, in the end the cluster works. But > this is not acceptable, because the cluster is unavailable until I finish the > full upgrade of all nodes: I need to perform a step-update, one node after > the other, where only 1 node is temporarily down and the other N-1 stay up. > Any idea on how to further investigate the issue? Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org