Re: Potential issues during 4.0 upgrade

2021-08-23 Thread Scott Andreas
Thank you for raising this, Sam!

Agreed this is a bug that warrants releasing 4.0.1 and notifying user@.

To elaborate on impact, this issue can produce a state in rolling 3.x -> 4.0 
upgrades in which 4.0 nodes fail to serialize gossip state during the shadow 
round once the size of this state exceeds 128kb. This prevents new instances 
from coming up. Once in this state, it is also not possible for new instances 
to start up and join the ring. If existing 4.0 instances restart, they will 
also be unable to gossip and remain down.

It's a pretty serious situation without an obvious way out aside from deploying 
this patch. We should get a new release out quickly.

– Scott


From: Sam Tunnicliffe 
Sent: Monday, August 23, 2021 11:27 AM
To: dev@cassandra.apache.org
Subject: Potential issues during 4.0 upgrade

Hi all,

I just opened a JIRA which is relevant to those running large clusters (around 
the 400 node range) and who have plans to upgrade to 4.0 upgrades soon.

https://issues.apache.org/jira/browse/CASSANDRA-16877 
<https://issues.apache.org/jira/browse/CASSANDRA-16877>

The issue is that in large clusters, the size of gossip messages sent when a 
node (re)starts may exceed the hard limit of the urgent message channel. This 
causes an error on the sender and ultimately the message is dropped. This in 
turn can cause startup failures and/or partial loss of availability.

Fortunately, the fix is quite simple and I’ve submitted a patch that I and 
other contributors have been running since discovering this issue and can 
confirm resolves the problem. It would be great to get it reviewed and merged 
ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest 
that operators of large clusters hold off on any planned 4.0 upgrades.

Thanks,
Sam


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Potential issues during 4.0 upgrade

2021-08-23 Thread Sam Tunnicliffe
Hi all,

I just opened a JIRA which is relevant to those running large clusters (around 
the 400 node range) and who have plans to upgrade to 4.0 upgrades soon. 

https://issues.apache.org/jira/browse/CASSANDRA-16877 
 

The issue is that in large clusters, the size of gossip messages sent when a 
node (re)starts may exceed the hard limit of the urgent message channel. This 
causes an error on the sender and ultimately the message is dropped. This in 
turn can cause startup failures and/or partial loss of availability.  

Fortunately, the fix is quite simple and I’ve submitted a patch that I and 
other contributors have been running since discovering this issue and can 
confirm resolves the problem. It would be great to get it reviewed and merged 
ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest 
that operators of large clusters hold off on any planned 4.0 upgrades.

Thanks,
Sam