[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794439#comment-17794439
 ] 

Brandon Williams commented on CASSANDRA-19178:
----------------------------------------------

Yes, you can make the seed provider reload the seeds with 'nodetool reloadseeds'

 

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -------------------------------------------------
>
>                 Key: CASSANDRA-19178
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Aldo
>            Priority: Normal
>         Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: I need to perform a step-update, one node after 
> the other, where only 1 node is temporarily down and the other N-1 stay up.
> Any idea on how to further investigate the issue? Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to