Looking over the code the single usage that differs between UPGRADING and NONE is
org.apache.cassandra.db.rows.Cell#getVersionedMaxDeletiontionTime public static long getVersionedMaxDeletiontionTime() { if (DatabaseDescriptor.getStorageCompatibilityMode().disabled()) // The whole cluster is 2016, we're out of the 2038/2106 mixed cluster scenario. Shortcut to avoid the 'minClusterVersion' volatile read return Cell.MAX_DELETION_TIME; else return MessagingService.instance().versions.minClusterVersion >= MessagingService.VERSION_50 ? Cell.MAX_DELETION_TIME : Cell.MAX_DELETION_TIME_2038_LEGACY_CAP; } So if you are in upgrading we allow each node to use the VERSION_50 messaging version, so while this is being rolled out some nodes will be on v4 other will be on v5. Once a local node has learned that all peers are at least on v5 then it acts the same as if you were on NONE. If you skip UPGRADING you ignore this slow rollout of v5 protocol. Where I think this matters is when we send LivenessInfo / Cell back to a coordinator…. Both encode as a relative time, so that feels safe (assuming you are making this change before the year 2038). As far as I can tell, the only real difference between UPGRADING and NONE is a volatile read while constructing a Cell while in SCM=UPGRADING. Given this It does feel like we could simplify this to a single bounce and just ignoring UPGRADING all together? > On Aug 22, 2025, at 7:51 AM, Paulo Motta <pa...@apache.org> wrote: > > Hi, > > I wanted to discuss the online upgrade procedure from 4.X to 5.x that > increased the number of rolling restarts required from 1 to 3, making the > upgrade procedure more cumbersome to operators. > > The main reason for this change as far as I understand is to support larger > TTLs. To give some context, CASSANDRA-14092 capped the maximum TTL expiration > date to 2038 which is the maximum deletionTime that can be represented in a > signed integer (version -na-). CASSANDRA-14227 expanded the maximum > expiration date to 2106 by updating the storage format to use an unsigned > integer instead to represent deletionTime (version -nc-). > > In order to support seamless upgrade from 4.X (maxExpirationDate=2038) to 5.X > (maxExpirationDate=2106), the upgrade procedure described in [1][2] suggests > the following steps: > 1) Rolling restart the cluster with storage_compatibility_mode=CASSANDRA_4. > At this point, maxEpirationDate=2038. > 2) Rolling restart the cluster with storage_compatibility_mode=UPGRADING. At > this point, maxEpirationDate is 2038 before all nodes are upgraded, and > maxEpirationDate=2106 after all nodes are deemed upgraded. > 3) Rolling restart the cluster with storage_compatibility_mode=NONE. At this > point, maxExpirationDate=2106. > > In my understanding users are encouraged to start in > storage_compatibility_mode=4 for 2 reasons: > A) Allow rollback to Cassandra 4 if something goes wrong during an upgrade, > decoupling the binary upgrade from the storage version upgrade, allowing > users to build confidence in the binary upgrade before doing the storage > version upgrade, where higher TTLs are supported. > B) During mixed mode, prevent a streaming or write operation with a higher > TTL from being sent to a node in 4.0 which does not support this yet. > > When the node moves to storage_compatibility_mode=UPGRADING, the node's > storage format changes to 5.0 format and a rollback to 4 is no longer > possible, but it still prevents sending a higher TTL to a node which is > already in 5.0 but still in storage_compatibility_mode=4. > > I'm uncertain about the requirement for the third rolling restart to bring > the storage_compatibility to NONE. The main reason given in [2] is: > > This eliminates the cost of checking node versions and ensures stability. > > If Cassandra was started at the previous version by accident, a node with > > disabled compatibility mode would no longer toggle behaviors as when it was > > running in the UPGRADING mode. > > I believe the cost of checking versions[3] is negligible and does not justify > a third restart. Regarding the storage compatibility mode stability, I think > we can address this by persisting the storage version in a system table to > ensure that once a node goes to storage version 5 it can longer switch back > to 4. > > I think the upgrade instructions added by CASSANDRA-14227 conflated > downgradbility of storage with increase of maximum supported TTL, which may > put an unnecessary burden on operators by requiring 3 restarts. > > I'd like to propose simplifying the upgrade instructions to the following: > 1) If you'd like to be able to downgrade to 4.0 seamlessly, start with > storage_compatibility_mode=4. Once you are confident with Cassandra 5.0, do a > rolling restart with storage_compatibility_mode=NONE, two restarts needed - > no UPGRADING step needed. > 2) If you are starting on 5.0 or are confident with 5.0 storage format, start > with storage_compatibility_mode=NONE, single restart needed, no downgrade > supported. > > In order to support this, a new field storage_version would be added to the > system_local table. When storage_compatibility_mode=NONE and all peers are in > 5.0, this field would be populated with 5. Support to TTLs beyond 2038 are > gated on this flag. > > Please let me know what you think and if you think it is worth pursuing this > effort to simplify the upgrade to 5.x. > > Thanks, > > Paulo > > [1] - https://github.com/apache/cassandra/blob/cassandra-5.0/NEWS.txt#L15-L21 > [2] - > https://github.com/apache/cassandra/blob/cassandra-5.0/conf/cassandra.yaml#L2275-L2281 > [3] - > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L97