rstest created CASSANDRA-21446:
----------------------------------

             Summary: DDL during 4.1 → 5.0 rolling upgrade can leave a table 
without column rows; node fails startup with MissingColumns
                 Key: CASSANDRA-21446
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21446
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Cluster/Schema
            Reporter: rstest


h2. Summary:

4.1 -> 5.0 rolling upgrade deterministically bricks the not-yet-upgraded node: 
column_masks content makes column-bearing schema pushes undecodable on 4.1, a 
surviving option-only ALTER push plants a table row without columns, and the 
node then fails startup with SchemaKeyspace$MissingColumns
h2. Description

During a 4.1.10 -> 5.0.6 rolling upgrade with {{{}storage_compatibility_mode: 
CASSANDRA_4{}}}, ordinary DDL executed through the already-upgraded node 
reliably leaves the still-old node with a {{system_schema.tables}} row that has 
*no* {{system_schema.columns}} rows. The old node logs 
{{SchemaKeyspace$MissingColumns}} on {{MigrationStage}} while still running, 
and when the rolling upgrade then restarts that node on 5.0.6, startup fails in 
{{Schema.loadFromDisk()}} with the same {{MissingColumns}} and the daemon 
crash-loops indefinitely. The rolling upgrade is stuck mid-flight; recovery 
requires {{-Dcassandra.ignore_corrupted_schema_tables=true}} or manually 
deleting rows from {{{}system_schema.tables{}}}/{{{}system_schema.columns{}}} 
(as the error message itself suggests).

Three independent behaviors compose into this outcome. We describe each because 
they likely warrant separate fixes.
h3. (1) Every column-bearing schema push from 5.0 is undecodable on 4.1 because 
of column_masks

{{system_schema.column_masks}} was introduced in 5.0 (Dynamic Data Masking / 
CEP-20) and {*}does not exist on 4.1.10{*}. In 5.0.6, 
{{SchemaKeyspace.addColumnToSchemaMutation()}} (SchemaKeyspace.java:712) 
touches {{ColumnMasks}} for *every* serialized user-table column, *even when 
the column has no mask* (it writes a delete-marker via 
{{{}maskBuilder.delete(){}}}). Consequently every schema mutation that 
serializes columns - {{{}CREATE TABLE{}}}, {{{}ALTER ... ADD{}}}, PK {{RENAME}} 
- embeds a partition update for table id 
{{738cc5ed-0168-3268-b9d1-853d4bc278af}} ({{{}system_schema.column_masks{}}}).

A 4.1.10 receiver deserializes a pushed schema mutation as a unit: it hits the 
unknown table id, throws {{{}UnknownTableException{}}}, and *the entire push is 
dropped* - the {{tables}} row and all {{columns}} rows are lost together with 
the masks rows. The drop is silent on the sender; nothing retries.

Notably, the code already acknowledges this exact hazard: the 
{{isReplicatedSystemKeyspace}} branch in {{addColumnToSchemaMutation}} avoids 
masks on distributed system keyspaces because "old nodes without DDM ... won't 
know what to do with the mask mutations". The same consideration is not applied 
to user-table schema pushes during the mixed-version window.
h3. (2) With storage_compatibility_mode: CASSANDRA_4, schema push works but 
pull is blocked - so nothing can repair the old node

The push gate ({{{}MigrationCoordinator.shouldPushSchemaTo{}}}) checks only raw 
messaging-version equality. Under {{CASSANDRA_4}} compatibility mode, 5.0.6 
runs {{current_version = VERSION_40}} (v12, MessagingService.java:257), equal 
to 4.1.10's - so pushes *do* flow from the 5.0 node to the 4.1 node.

The pull gate ({{{}MigrationCoordinator.shouldPullFromEndpoint{}}}) 
additionally requires the peer's release *major* to match ("Not pulling schema 
from ... because release version ... is not major version ..."), per 
CASSANDRA-13274's deliberate "no schema exchange across major versions" 
restriction. So the 4.1 node can never pull the full (self-contained) schema 
from the 5.0 node.

The combination makes the schema channel push-only and fire-and-forget: with 
(1) dropping every push that carries column rows, there is *no mechanism at 
all* by which the column rows can reach the old node before it is upgraded.
h3. (3) ALTER mutations are not self-contained, and the receiver persists 
before validating

{{SchemaKeyspace.addAlterTableToSchemaMutation()}} (SchemaKeyspace.java:607) 
calls {{addTableToSchemaMutation(newTable, false)}} - a full 
{{system_schema.tables}} row rewrite with {{deletePrevious()}} - and then 
serializes only *changed* columns. An option-only ALTER (e.g. {{{}WITH 
speculative_retry = ...{}}}) has an empty column diff, so its mutation is a 
bare table row with *no* column updates - and therefore {*}no column_masks 
content{*}, which means it is the one push the 4.1 node CAN decode.

The 4.1 receiver ({{{}DefaultSchemaUpdateHandler.applyMutations{}}}) writes 
received mutations into local {{system_schema}} first and re-reads the keyspace 
afterwards. The re-read ({{{}fetchColumns{}}}, SchemaKeyspace.java:1076 in 5.0 
/ :987 in 4.1) throws {{MissingColumns}} when the table has zero column rows - 
but by then the invalid state is already persisted on disk. On a running node 
the throw only aborts the in-memory update; at startup 
({{{}CassandraDaemon.setup -> Schema.loadFromDisk -> ... -> fetchColumns{}}}) 
it is fatal, and {{fetchTables}} rethrows unless 
{{{}cassandra.ignore_corrupted_schema_tables=true{}}}.
h3. Net effect - why the upgrade is stuck every time

The poisoned node is exactly the node the rolling upgrade restarts next; both 
repair channels (column-bearing pushes, cross-major pull) are sealed before 
that restart; therefore the corrupt on-disk state is stable until the upgrade 
detonates it. No races, no faults: the sequence CREATE TABLE (via the new node) 
followed by any column-free table-row rewrite reaching the old node produces 
the brick deterministically. The supporting evidence matches exactly: the old 
node ends with the "Columns not found" form of MissingColumns (zero column rows 
- even a later ALTER ADD's column is absent, because that push was dropped too).
h2. Steps to Reproduce

Two nodes on 4.1.10; 5.0.6 binaries with {{{}storage_compatibility_mode: 
CASSANDRA_4{}}}.
{code:sql}
-- step 1: both nodes on 4.1.10
CREATE KEYSPACE ks WITH replication = 
{'class':'SimpleStrategy','replication_factor':2};
{code}
 # Rolling-upgrade node 1 to 5.0.6 (drain, stop, swap binaries, start). Cluster 
is now mixed 5.0/4.1.
 # Through *node 1* (the 5.0.6 node):
{code:sql}
CREATE TABLE ks.v (cs text, brvqnt int, qzctg set<int>, PRIMARY KEY (brvqnt)) 
WITH speculative_retry = '90MS';
ALTER TABLE ks.v WITH speculative_retry = 'ALWAYS';   -- option-only ALTER: the 
only decodable push
ALTER TABLE ks.v ADD ck set<int>;
{code}

 # Observe node 2 (4.1.10) log: {{ERROR [MigrationStage:1] ... 
SchemaKeyspace$MissingColumns: Columns not found in schema table for ks.v}} 
(stack: {{{}fetchColumns -> fetchTable -> fetchTables{}}}). Node 2 keeps 
running. {{SELECT * FROM system_schema.columns WHERE keyspace_name='ks' AND 
table_name='v'}} on node 2 returns no rows while {{system_schema.tables}} has 
the row.
 # Rolling-upgrade node 2 to 5.0.6.

Expected: node 2 starts on 5.0.6 with the full schema (or the mixed-window DDL 
had been rejected/propagated safely).
Actual: node 2 crash-loops:
{noformat}
ERROR [main] CassandraDaemon.java:287 - Error while loading schema:
org.apache.cassandra.schema.SchemaKeyspace$MissingColumns: Columns not found in 
schema table for ks.v
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:1081)
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:1032)
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:991)
    ...
    at org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:155)
    at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:283)
ERROR [main] CassandraDaemon.java:887 - Exception encountered during startup
{noformat}
The daemon exits and repeats identically on every restart (~9 s cycle 
observed). The rolling upgrade cannot proceed.

We reproduced the same outcome with other second-statement shapes ({{{}CREATE 
INDEX{}}}; {{CREATE}} immediately followed by a parameter-only {{{}ALTER{}}}); 
the requirement is only that a column-free table-row rewrite reaches the old 
node after the column-bearing CREATE push was dropped. We can provide recorded 
deterministic sequences and full per-lane logs for three independent 
reproductions.
h2. Root cause code pointers
 * {{SchemaKeyspace.addColumnToSchemaMutation}} (5.0.6 SchemaKeyspace.java:712) 
- unconditionally updates {{ColumnMasks}} for every user-table column 
(delete-marker when {{{}mask == null{}}}), making every column-bearing schema 
push undecodable by 4.1.
 * {{SchemaKeyspace.addAlterTableToSchemaMutation}} (5.0.6 
SchemaKeyspace.java:607) - {{addTableToSchemaMutation(newTable, false)}} 
rewrites the table row with {{deletePrevious()}} but serializes only changed 
columns; option-only ALTERs produce a bare, decodable table row.
 * {{MigrationCoordinator.shouldPushSchemaTo}} (:769/:775) - 
messaging-version-equality only; passes under {{CASSANDRA_4}} compatibility 
mode ({{{}MessagingService.java:257{}}}).
 * {{MigrationCoordinator.shouldPullFromEndpoint}} (:353) - cross-major pull 
refusal (CASSANDRA-13274); removes the repair path.
 * {{DefaultSchemaUpdateHandler.applyMutations}} (4.1.10 :201-214, 5.0.6 
:273-288) - persists received mutations before the re-read that detects the 
violation.
 * {{SchemaKeyspace.fetchColumns}} (5.0.6 :1076, throws :1081/:1087) and 
{{fetchTables}} rethrow-unless-{{{}ignore_corrupted_schema_tables{}}} - fatal 
at startup via {{Schema.loadFromDisk}} ({{{}CassandraDaemon.setup{}}}).

h2. Suggested fixes

Any one layer breaks the chain; (1) is the most targeted:
 # *Stop poisoning pushes:* omit the {{column_masks}} delete-marker when a 
column has no mask, or version-gate 5.0-only schema-table content out of pushes 
to pre-5.0 peers (the {{isReplicatedSystemKeyspace}} branch shows the pattern 
already exists). With decodable CREATE pushes, the old node receives complete 
schema and this scenario disappears.
 # *Make table-alter mutations safe for receivers without the base schema:* 
include primary-key column rows whenever the table row is rewritten, or have 
the receiver refuse to apply an alter for a table with no local column rows 
(triggering a repair instead).
 # *Validate before persisting:* {{applyMutations}} should check the post-state 
(every table row has columns incl. a partition key) before flushing to local 
{{{}system_schema{}}}, rather than persisting a state that is fatal at next 
boot.
 # *Repair at startup:* when {{loadFromDisk}} encounters table-without-columns 
and peers exist, attempt a schema pull/repair before treating it as 
unrecoverable corruption (today's options - ignore-flag or manual DELETEs from 
{{system_schema}} - are operator surgery on a mid-upgrade cluster).

h2. Additional context
 * The same end state (table row without columns -> fatal restart) can also be 
reached without any version skew when a node misses a CREATE during a network 
partition and later receives a diff-only ALTER push; we focus this report on 
the rolling-upgrade path because it is deterministic and blocks the documented 
upgrade procedure.
 * Related: CASSANDRA-13274 (cross-major schema exchange restriction) is what 
removes the pull-based repair path; we will file the {{column_masks}} 
push-deserialization issue observed independently (schema change lost, 
{{UnknownTableException}} for table id {{738cc5ed-0168-3268-b9d1-853d4bc278af}} 
on the 4.1 peer) as a separate report if preferred - it is ingredient (1) of 
this one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to