[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

Brandon Williams (Jira) Thu, 07 Dec 2023 14:05:19 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794449#comment-17794449
 ]


Brandon Williams commented on CASSANDRA-19178:
----------------------------------------------

One way out may be to add a new node to the cluster that knows about cassandra7 
and cassandra9 that can "introduce" those nodes to each other once it knows 
about their correct addresses.  It may not even need to complete bootstrapping 
for this to happen.

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -------------------------------------------------
>
>                 Key: CASSANDRA-19178
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Aldo
>            Priority: Normal
>         Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: I need to perform a step-update, one node after 
> the other, where only 1 node is temporarily down and the other N-1 stay up.
> Any idea on how to further investigate the issue? Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

Reply via email to