[ https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793944#comment-17793944 ]
Aldo commented on CASSANDRA-19178: ---------------------------------- I apologize in advance if reopening is not the correct behavior, please tell me if I need to open a new issue. I think I've discovered the source cause of the issue, and wonder if it's a bug or it's caused by a misconfiguration on my side. Using {{nodetool setlogginglevel org.apache.cassandra TRACE}} on both the 4.x upgraded node (cassandra7) and on the running 3.x seed node (cassandra9) I was able to isolate the relevant logs: On cassandra7: {code:java} TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 EndpointMessagingVersions.java:67 - Assuming current protocol version for tasks.cassandra9/10.0.2.92:7000 TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 OutboundConnectionInitiator.java:131 - creating outbound bootstrap to peer: (tasks.cassandra9/10.0.2.92:7000, tasks.cassandra9/10.0.2.92:7000), framing: CRC, encryption: unencrypted, requestVersion: 12 TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,411 OutboundConnectionInitiator.java:236 - starting handshake with peer tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000), msg = Initiate(request: 12, min: 10, max: 12, type: URGENT_MESSAGES, framing: true, from: tasks.cassandra7/10.0.2.137:7000) INFO [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,412 OutboundConnectionInitiator.java:390 - Failed to connect to peer tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000) io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer {code} On cassandra9: {code:java} TRACE [ACCEPT-tasks.cassandra9/10.0.2.92] 2023-12-06 22:16:56,411 MessagingService.java:1315 - Connection version 12 from /10.0.2.137 TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 IncomingTcpConnection.java:111 - IOException reading from socket; closing java.io.IOException: Peer-used messaging version 12 is larger than max supported 11 at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:153) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:98) TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 IncomingTcpConnection.java:125 - Closing socket Socket[addr=/10.0.2.137,port=45680,localport=7000] - isclosed: false {code} So it seems there is a mismatch on this {_}messaging version{_}. > Cluster upgrade 3.x -> 4.x fails with no internode encryption > ------------------------------------------------------------- > > Key: CASSANDRA-19178 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19178 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Aldo > Priority: Normal > Attachments: cassandra7.downgrade.log, cassandra7.log > > > I have a Docker swarm cluster with 3 distinct Cassandra services (named > {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different > servers. The 3 services are running the version 3.11.16, using the official > Cassandra image 3.11.16 on Docker Hub. The first service is configured just > with the following environment variables > {code:java} > CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7" > CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code} > which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance > the _cassandra.yaml_ for the first service contains the following (and the > rest is the image default): > {code:java} > # grep tasks /etc/cassandra/cassandra.yaml > - seeds: "tasks.cassandra7,tasks.cassandra9" > listen_address: tasks.cassandra7 > broadcast_address: tasks.cassandra7 > broadcast_rpc_address: tasks.cassandra7 {code} > Other services (8 and 9) have a similar configuration, obviously with a > different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and > {{{}tasks.cassandra9{}}}). > The cluster is running smoothly and all the nodes are perfectly able to > rejoin the cluster whichever event occurs, thanks to the Docker Swarm > {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for > Docker swarm to restart it, force update it in order to force a restart, > scale to 0 and then 1 the service, restart an entire server, turn off and > then turn on all the 3 servers. Never found an issue on this. > I also just completed a full upgrade of the cluster from version 2.2.8 to > 3.11.16 (simply upgrading the Docker official image associated with the > services) without issues. I was also able, thanks to a 2.2.8 snapshot on each > server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I > finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables > have now the {{me-*}} prefix. > > The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The > procedure that I follow is very simple: > # I start from the _cassandra7_ service (which is a seed node) > # {{nodetool drain}} > # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log > # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version > The procedure is exactly the same I followed for the upgrade 2.2.8 --> > 3.11.16, obviously with a different version at step 4. Unfortunately the > upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and > attempts to communicate with the other seed node ({_}cassandra9{_}) but the > log of _cassandra7_ shows the following: > {code:java} > INFO [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 > OutboundConnectionInitiator.java:390 - Failed to connect to peer > tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000) > io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: > Connection reset by peer{code} > The relevant port of the log, related to the missing internode communication, > is attached in _cassandra7.log_ > In the log of _cassandra9_ there is nothing after the abovementioned step #4. > So only _cassandra7_ is saying something in the logs. > I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is > always the same. Of course when I follow the steps 1..3, then restore the 3.x > snapshot and finally perform the step #4 using the official 3.11.16 version > the node 7 restarts correctly and joins the cluster. I attached the relevant > part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that > node 7 and 9 can communicate. > I suspect this could be related to the port 7000 now (with Cassandra 4.x) > supporting both encrypted and unencrypted traffic. As stated previously I'm > using the untouched official Cassandra images so all my cluster, inside the > Docker Swarm, is not (and has never been) configured with encryption. > I can also add the following: if I perform the 4 above steps also for the > _cassandra9_ and _cassandra8_ services, in the end the cluster works. But > this is not acceptable, because the cluster is unavailable until I finish the > full upgrade of all nodes: I need to perform a step-update, one node after > the other, where only 1 node is temporarily down and the other N-1 stay up. > Any idea on how to further investigate the issue? Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org