Hi,

I wanted to discuss the online upgrade procedure from 4.X to 5.x that
increased the number of rolling restarts required from 1 to 3, making the
upgrade procedure more cumbersome to operators.

The main reason for this change as far as I understand is to support larger
TTLs. To give some context, CASSANDRA-14092 capped the maximum TTL
expiration date to 2038 which is the maximum deletionTime that can be
represented in a signed integer (version -na-). CASSANDRA-14227 expanded
the maximum expiration date to 2106 by updating the storage format to use
an unsigned integer instead to represent deletionTime (version -nc-).

In order to support seamless upgrade from 4.X (maxExpirationDate=2038) to
5.X (maxExpirationDate=2106), the upgrade procedure described in [1][2]
suggests the following steps:
1) Rolling restart the cluster with storage_compatibility_mode=CASSANDRA_4.
At this point, maxEpirationDate=2038.
2) Rolling restart the cluster with storage_compatibility_mode=UPGRADING.
At this point, maxEpirationDate is 2038 before all nodes are upgraded, and
maxEpirationDate=2106 after all nodes are deemed upgraded.
3) Rolling restart the cluster with storage_compatibility_mode=NONE. At
this point, maxExpirationDate=2106.

In my understanding users are encouraged to start in
storage_compatibility_mode=4 for 2 reasons:
A) Allow rollback to Cassandra 4 if something goes wrong during an upgrade,
decoupling the binary upgrade from the storage version upgrade, allowing
users to build confidence in the binary upgrade before doing the storage
version upgrade, where higher TTLs are supported.
B) During mixed mode, prevent a streaming or write operation with a higher
TTL from being sent to a node in 4.0 which does not support this yet.

When the node moves to storage_compatibility_mode=UPGRADING, the node's
storage format changes to 5.0 format and a rollback to 4 is no longer
possible, but it still prevents sending a higher TTL to a node which is
already in 5.0 but still in storage_compatibility_mode=4.

I'm uncertain about the requirement for the third rolling restart to bring
the storage_compatibility to NONE. The main reason given in [2] is:
> This eliminates the cost of checking node versions and ensures stability.
If Cassandra was started at the previous version by accident, a node with
disabled compatibility mode would no longer toggle behaviors as when it was
running in the UPGRADING mode.

I believe the cost of checking versions[3] is negligible and does not
justify a third restart. Regarding the storage compatibility mode
stability, I think we can address this by persisting the storage version in
a system table to ensure that once a node goes to storage version 5 it can
longer switch back to 4.

I think the upgrade instructions added by CASSANDRA-14227 conflated
downgradbility of storage with increase of maximum supported TTL, which may
put an unnecessary burden on operators by requiring 3 restarts.

I'd like to propose simplifying the upgrade instructions to the following:
1) If you'd like to be able to downgrade to 4.0 seamlessly, start with
storage_compatibility_mode=4. Once you are confident with Cassandra 5.0, do
a rolling restart with storage_compatibility_mode=NONE, two restarts needed
- no UPGRADING step needed.
2) If you are starting on 5.0 or are confident with 5.0 storage format,
start with storage_compatibility_mode=NONE, single restart needed, no
downgrade supported.

In order to support this, a new field storage_version would be added to the
system_local table. When storage_compatibility_mode=NONE and all peers are
in 5.0, this field would be populated with 5. Support to TTLs beyond 2038
are gated on this flag.

Please let me know what you think and if you think it is worth pursuing
this effort to simplify the upgrade to 5.x.

Thanks,

Paulo

[1] -
https://github.com/apache/cassandra/blob/cassandra-5.0/NEWS.txt#L15-L21
[2] -
https://github.com/apache/cassandra/blob/cassandra-5.0/conf/cassandra.yaml#L2275-L2281
[3] -
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L97

Reply via email to