Writes after the column is dropped, broken reads, and TCM

Jakub Zytka Mon, 26 Jun 2023 09:39:58 -0700

Hello,

I want to discuss the possibilities of fixing CASSANDRA-18591 and
CASSANDRA-18589 (~exceptions during reads), considering that TCM will
become a reality soon.
While both issues can be hit even on a single-node cluster, I think it's
important for the solution to be at least TCM-friendly and easily
extendable in the future.


"Writes that occurred post-column drop" are at the core of both issues.
For the purpose of this thread, I define such "post-drop-writes" using the
DeserializationHelper::isDropped and
DeserializationHelper::isDroppedComplexDeletion functions. Namely,
post-drop-writes are writes with a timestamp exceeding the corresponding
column drop time.

The problem manifestation:
Trying to SELECT rows with such writes (for instance, SELECT * ...) ends
with AssertionFailure or NPE.
Furthermore, the problem is persistent: it is not remedied by compaction
and does not require schema disagreement to occur.

Bug details:

Post-drop-writes are created in two ways:
1. There's a race condition between two nodes, where a write is processed
by a node that is unaware of the schema change, thus acquiring a larger
timestamp.
2. Another race condition can occur within a node, where writes only verify
the column's existence. However, there's no synchronization between column
drops and writes. Hence, even with single-node clusters, we can still
encounter post-drop-writes.

I assume such writes are unavoidable now (and possibly also with TCM; at
least in it's first incarnation).

These post-drop-writes cause issues during reads, as different parts of the
read path treat such writes differently.
Importantly, these discrepancies occur without any schema disagreement or
similar race conditions:

1. ColumnFilter
ColumnFilter answers i.a. a question whether a specific column should be
fetched. In the presence of dropped columns, the answer to this question
differs depending on how it is asked. Specifically, there is a discrepancy
between ColumnFilter::fetches() and ColumnFilter::fetchedColumns().
fetches(someDroppedColumn) may be returning true, but the fetchedColumns()
do not contain such a column.
While surprising, counter-intuitive, and error-prone, it's not immediately
problematic as far as correctness is concerned:
Currently, it is `ColumnFilter::fetches` that is being used to tell whether
data for a particular column should be skipped or not during
deserialization.
The reason that `ColumnFilter::fetches` may be returning true for each
invocation regardless of the actual column is performance.

So, for the time being, let's assume that a ColumnFilter is allowed to let
in data of dropped columns and that such case should be properly documented
etc. etc.

2. UnfilteredSerializer::read[Complex|Simple]Column + DeserializationHelper

DeserializationHelper::isDropped and
DeserializationHelper::isDroppedComplexDeletion help UnfilteredSerializer
skip data that was written *before* column drop time.
This implies that writes to dropped columns are expected at this point (or
does it?). At least, that explains why the pre-drop-writes do not cause
issues, even though ColumnFilter lets them in.

3. Row::Merger::ColumnDataReducer

ColumnDataReducer has an optimization that if the schema does not contain
complex columns, the complexBuilder is not constructed (==null).
However, since we can read a write to a dropped complex column, we
eventually hit NPE in getReduced, when we attempt to use the complexBuilder.
While this NPE (CASSANDRA-18589) can be fixed easily, it suggests that we
may already be expecting no post-drop-writes at this point in the code.

4. Unfiltered::serializeRowBody

serializeRowBody explicitly asserts that the SerializationHelper is aware
of a column we're about to serialize.
This is not true for writes to dropped columns, causing CASSANDRA-18591.

There might be other places in the read path that I haven't identified,
which also make different assumptions about post-drop-writes.


So, how do we resolve these issues?
The most straightforward solution, which I do not discount, is to introduce
and document a system limitation:
Abstain from dropping columns or making schema changes during writes.

But if we wish to tackle this issue programmatically, we need to consider
various scenarios during read operations:

When performing read:
1. the coordinator and replica share the same schema
2. the coordinator is aware of the dropped column but not the replica
3. the replica is aware of the dropped column, but not the coordinator
4. the coordinator is aware of the dropped column and its subsequent
re-addition (both `columns` and `droppedColumns` in `TableMetadata` contain
a column with the same name),
5. the coordinator is aware of the dropped column and its subsequent column
re-addition, but the replica only knows about column drop
6. the replica is aware of the dropped column and its subsequent
re-addition, but not the coordinator is not
7. The replica is aware of the dropped column and its subsequent
re-addition, but the coordinator is only aware of the drop

It is unclear to me what should happen in each case, especially considering
that we want to (as I assume) avoid resurrecting writes.
I've got several ideas, but first, I'd like to confirm that my thinking
about the problem resonates with you. Does it?

Finally, let's consider the role of Transactional Cluster Metadata (TCM).
With TCM, distinguishing and handling some of the aforementioned scenarios
may become easier or even feasible. However, my familiarity with TCM
internals and future plans is limited. Therefore, I would greatly
appreciate any insights into the relationship between TCM and scenarios
such as the ones I've described above.

-- 
Jakub Zytka
e. jakub.zy...@datastax.com
w. www.datastax.com

Writes after the column is dropped, broken reads, and TCM

Reply via email to