rstest created CASSANDRA-21446:
----------------------------------
Summary: DDL during 4.1 → 5.0 rolling upgrade can leave a table
without column rows; node fails startup with MissingColumns
Key: CASSANDRA-21446
URL: https://issues.apache.org/jira/browse/CASSANDRA-21446
Project: Apache Cassandra
Issue Type: Bug
Components: Cluster/Schema
Reporter: rstest
h2. Summary:
4.1 -> 5.0 rolling upgrade deterministically bricks the not-yet-upgraded node:
column_masks content makes column-bearing schema pushes undecodable on 4.1, a
surviving option-only ALTER push plants a table row without columns, and the
node then fails startup with SchemaKeyspace$MissingColumns
h2. Description
During a 4.1.10 -> 5.0.6 rolling upgrade with {{{}storage_compatibility_mode:
CASSANDRA_4{}}}, ordinary DDL executed through the already-upgraded node
reliably leaves the still-old node with a {{system_schema.tables}} row that has
*no* {{system_schema.columns}} rows. The old node logs
{{SchemaKeyspace$MissingColumns}} on {{MigrationStage}} while still running,
and when the rolling upgrade then restarts that node on 5.0.6, startup fails in
{{Schema.loadFromDisk()}} with the same {{MissingColumns}} and the daemon
crash-loops indefinitely. The rolling upgrade is stuck mid-flight; recovery
requires {{-Dcassandra.ignore_corrupted_schema_tables=true}} or manually
deleting rows from {{{}system_schema.tables{}}}/{{{}system_schema.columns{}}}
(as the error message itself suggests).
Three independent behaviors compose into this outcome. We describe each because
they likely warrant separate fixes.
h3. (1) Every column-bearing schema push from 5.0 is undecodable on 4.1 because
of column_masks
{{system_schema.column_masks}} was introduced in 5.0 (Dynamic Data Masking /
CEP-20) and {*}does not exist on 4.1.10{*}. In 5.0.6,
{{SchemaKeyspace.addColumnToSchemaMutation()}} (SchemaKeyspace.java:712)
touches {{ColumnMasks}} for *every* serialized user-table column, *even when
the column has no mask* (it writes a delete-marker via
{{{}maskBuilder.delete(){}}}). Consequently every schema mutation that
serializes columns - {{{}CREATE TABLE{}}}, {{{}ALTER ... ADD{}}}, PK {{RENAME}}
- embeds a partition update for table id
{{738cc5ed-0168-3268-b9d1-853d4bc278af}} ({{{}system_schema.column_masks{}}}).
A 4.1.10 receiver deserializes a pushed schema mutation as a unit: it hits the
unknown table id, throws {{{}UnknownTableException{}}}, and *the entire push is
dropped* - the {{tables}} row and all {{columns}} rows are lost together with
the masks rows. The drop is silent on the sender; nothing retries.
Notably, the code already acknowledges this exact hazard: the
{{isReplicatedSystemKeyspace}} branch in {{addColumnToSchemaMutation}} avoids
masks on distributed system keyspaces because "old nodes without DDM ... won't
know what to do with the mask mutations". The same consideration is not applied
to user-table schema pushes during the mixed-version window.
h3. (2) With storage_compatibility_mode: CASSANDRA_4, schema push works but
pull is blocked - so nothing can repair the old node
The push gate ({{{}MigrationCoordinator.shouldPushSchemaTo{}}}) checks only raw
messaging-version equality. Under {{CASSANDRA_4}} compatibility mode, 5.0.6
runs {{current_version = VERSION_40}} (v12, MessagingService.java:257), equal
to 4.1.10's - so pushes *do* flow from the 5.0 node to the 4.1 node.
The pull gate ({{{}MigrationCoordinator.shouldPullFromEndpoint{}}})
additionally requires the peer's release *major* to match ("Not pulling schema
from ... because release version ... is not major version ..."), per
CASSANDRA-13274's deliberate "no schema exchange across major versions"
restriction. So the 4.1 node can never pull the full (self-contained) schema
from the 5.0 node.
The combination makes the schema channel push-only and fire-and-forget: with
(1) dropping every push that carries column rows, there is *no mechanism at
all* by which the column rows can reach the old node before it is upgraded.
h3. (3) ALTER mutations are not self-contained, and the receiver persists
before validating
{{SchemaKeyspace.addAlterTableToSchemaMutation()}} (SchemaKeyspace.java:607)
calls {{addTableToSchemaMutation(newTable, false)}} - a full
{{system_schema.tables}} row rewrite with {{deletePrevious()}} - and then
serializes only *changed* columns. An option-only ALTER (e.g. {{{}WITH
speculative_retry = ...{}}}) has an empty column diff, so its mutation is a
bare table row with *no* column updates - and therefore {*}no column_masks
content{*}, which means it is the one push the 4.1 node CAN decode.
The 4.1 receiver ({{{}DefaultSchemaUpdateHandler.applyMutations{}}}) writes
received mutations into local {{system_schema}} first and re-reads the keyspace
afterwards. The re-read ({{{}fetchColumns{}}}, SchemaKeyspace.java:1076 in 5.0
/ :987 in 4.1) throws {{MissingColumns}} when the table has zero column rows -
but by then the invalid state is already persisted on disk. On a running node
the throw only aborts the in-memory update; at startup
({{{}CassandraDaemon.setup -> Schema.loadFromDisk -> ... -> fetchColumns{}}})
it is fatal, and {{fetchTables}} rethrows unless
{{{}cassandra.ignore_corrupted_schema_tables=true{}}}.
h3. Net effect - why the upgrade is stuck every time
The poisoned node is exactly the node the rolling upgrade restarts next; both
repair channels (column-bearing pushes, cross-major pull) are sealed before
that restart; therefore the corrupt on-disk state is stable until the upgrade
detonates it. No races, no faults: the sequence CREATE TABLE (via the new node)
followed by any column-free table-row rewrite reaching the old node produces
the brick deterministically. The supporting evidence matches exactly: the old
node ends with the "Columns not found" form of MissingColumns (zero column rows
- even a later ALTER ADD's column is absent, because that push was dropped too).
h2. Steps to Reproduce
Two nodes on 4.1.10; 5.0.6 binaries with {{{}storage_compatibility_mode:
CASSANDRA_4{}}}.
{code:sql}
-- step 1: both nodes on 4.1.10
CREATE KEYSPACE ks WITH replication =
{'class':'SimpleStrategy','replication_factor':2};
{code}
# Rolling-upgrade node 1 to 5.0.6 (drain, stop, swap binaries, start). Cluster
is now mixed 5.0/4.1.
# Through *node 1* (the 5.0.6 node):
{code:sql}
CREATE TABLE ks.v (cs text, brvqnt int, qzctg set<int>, PRIMARY KEY (brvqnt))
WITH speculative_retry = '90MS';
ALTER TABLE ks.v WITH speculative_retry = 'ALWAYS'; -- option-only ALTER: the
only decodable push
ALTER TABLE ks.v ADD ck set<int>;
{code}
# Observe node 2 (4.1.10) log: {{ERROR [MigrationStage:1] ...
SchemaKeyspace$MissingColumns: Columns not found in schema table for ks.v}}
(stack: {{{}fetchColumns -> fetchTable -> fetchTables{}}}). Node 2 keeps
running. {{SELECT * FROM system_schema.columns WHERE keyspace_name='ks' AND
table_name='v'}} on node 2 returns no rows while {{system_schema.tables}} has
the row.
# Rolling-upgrade node 2 to 5.0.6.
Expected: node 2 starts on 5.0.6 with the full schema (or the mixed-window DDL
had been rejected/propagated safely).
Actual: node 2 crash-loops:
{noformat}
ERROR [main] CassandraDaemon.java:287 - Error while loading schema:
org.apache.cassandra.schema.SchemaKeyspace$MissingColumns: Columns not found in
schema table for ks.v
at
org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:1081)
at
org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:1032)
at
org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:991)
...
at org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:155)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:283)
ERROR [main] CassandraDaemon.java:887 - Exception encountered during startup
{noformat}
The daemon exits and repeats identically on every restart (~9 s cycle
observed). The rolling upgrade cannot proceed.
We reproduced the same outcome with other second-statement shapes ({{{}CREATE
INDEX{}}}; {{CREATE}} immediately followed by a parameter-only {{{}ALTER{}}});
the requirement is only that a column-free table-row rewrite reaches the old
node after the column-bearing CREATE push was dropped. We can provide recorded
deterministic sequences and full per-lane logs for three independent
reproductions.
h2. Root cause code pointers
* {{SchemaKeyspace.addColumnToSchemaMutation}} (5.0.6 SchemaKeyspace.java:712)
- unconditionally updates {{ColumnMasks}} for every user-table column
(delete-marker when {{{}mask == null{}}}), making every column-bearing schema
push undecodable by 4.1.
* {{SchemaKeyspace.addAlterTableToSchemaMutation}} (5.0.6
SchemaKeyspace.java:607) - {{addTableToSchemaMutation(newTable, false)}}
rewrites the table row with {{deletePrevious()}} but serializes only changed
columns; option-only ALTERs produce a bare, decodable table row.
* {{MigrationCoordinator.shouldPushSchemaTo}} (:769/:775) -
messaging-version-equality only; passes under {{CASSANDRA_4}} compatibility
mode ({{{}MessagingService.java:257{}}}).
* {{MigrationCoordinator.shouldPullFromEndpoint}} (:353) - cross-major pull
refusal (CASSANDRA-13274); removes the repair path.
* {{DefaultSchemaUpdateHandler.applyMutations}} (4.1.10 :201-214, 5.0.6
:273-288) - persists received mutations before the re-read that detects the
violation.
* {{SchemaKeyspace.fetchColumns}} (5.0.6 :1076, throws :1081/:1087) and
{{fetchTables}} rethrow-unless-{{{}ignore_corrupted_schema_tables{}}} - fatal
at startup via {{Schema.loadFromDisk}} ({{{}CassandraDaemon.setup{}}}).
h2. Suggested fixes
Any one layer breaks the chain; (1) is the most targeted:
# *Stop poisoning pushes:* omit the {{column_masks}} delete-marker when a
column has no mask, or version-gate 5.0-only schema-table content out of pushes
to pre-5.0 peers (the {{isReplicatedSystemKeyspace}} branch shows the pattern
already exists). With decodable CREATE pushes, the old node receives complete
schema and this scenario disappears.
# *Make table-alter mutations safe for receivers without the base schema:*
include primary-key column rows whenever the table row is rewritten, or have
the receiver refuse to apply an alter for a table with no local column rows
(triggering a repair instead).
# *Validate before persisting:* {{applyMutations}} should check the post-state
(every table row has columns incl. a partition key) before flushing to local
{{{}system_schema{}}}, rather than persisting a state that is fatal at next
boot.
# *Repair at startup:* when {{loadFromDisk}} encounters table-without-columns
and peers exist, attempt a schema pull/repair before treating it as
unrecoverable corruption (today's options - ignore-flag or manual DELETEs from
{{system_schema}} - are operator surgery on a mid-upgrade cluster).
h2. Additional context
* The same end state (table row without columns -> fatal restart) can also be
reached without any version skew when a node misses a CREATE during a network
partition and later receives a diff-only ALTER push; we focus this report on
the rolling-upgrade path because it is deterministic and blocks the documented
upgrade procedure.
* Related: CASSANDRA-13274 (cross-major schema exchange restriction) is what
removes the pull-based repair path; we will file the {{column_masks}}
push-deserialization issue observed independently (schema change lost,
{{UnknownTableException}} for table id {{738cc5ed-0168-3268-b9d1-853d4bc278af}}
on the 4.1 peer) as a separate report if preferred - it is ingredient (1) of
this one.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]