[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

Aldo (Jira) Wed, 06 Dec 2023 14:26:05 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793944#comment-17793944
 ]


Aldo commented on CASSANDRA-19178:
----------------------------------

I apologize in advance if reopening is not the correct behavior, please tell me 
if I need to open a new issue. 
I think I've discovered the source cause of the issue, and wonder if it's a bug 
or it's caused by a misconfiguration on my side.
 
Using {{nodetool setlogginglevel org.apache.cassandra TRACE}} on both the 4.x 
upgraded node (cassandra7) and on the running 3.x seed node (cassandra9) I was 
able to isolate the relevant logs:
 
On cassandra7:
 
 
{code:java}
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
EndpointMessagingVersions.java:67 - Assuming current protocol version for 
tasks.cassandra9/10.0.2.92:7000 
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
OutboundConnectionInitiator.java:131 - creating outbound bootstrap to peer: 
(tasks.cassandra9/10.0.2.92:7000, tasks.cassandra9/10.0.2.92:7000), framing: 
CRC, encryption: unencrypted, requestVersion: 12
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,411 
OutboundConnectionInitiator.java:236 - starting handshake with peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000), msg = 
Initiate(request: 12, min: 10, max: 12, type: URGENT_MESSAGES, framing: true, 
from: tasks.cassandra7/10.0.2.137:7000) 
INFO  [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,412 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000) 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer  {code}
 
On cassandra9:
 
{code:java}
TRACE [ACCEPT-tasks.cassandra9/10.0.2.92] 2023-12-06 22:16:56,411 
MessagingService.java:1315 - Connection version 12 from /10.0.2.137
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:111 - IOException reading from socket; closing
java.io.IOException: Peer-used messaging version 12 is larger than max 
supported 11
        at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:153)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:98)
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:125 - Closing socket 
Socket[addr=/10.0.2.137,port=45680,localport=7000] - isclosed: false {code}
 
So it seems there is a mismatch on this {_}messaging version{_}.
 

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-19178
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Aldo
>            Priority: Normal
>         Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: I need to perform a step-update, one node after 
> the other, where only 1 node is temporarily down and the other N-1 stay up.
> Any idea on how to further investigate the issue? Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

Reply via email to