Hello, I want to discuss the possibilities of fixing CASSANDRA-18591 and CASSANDRA-18589 (~exceptions during reads), considering that TCM will become a reality soon. While both issues can be hit even on a single-node cluster, I think it's important for the solution to be at least TCM-friendly and easily extendable in the future.
"Writes that occurred post-column drop" are at the core of both issues. For the purpose of this thread, I define such "post-drop-writes" using the DeserializationHelper::isDropped and DeserializationHelper::isDroppedComplexDeletion functions. Namely, post-drop-writes are writes with a timestamp exceeding the corresponding column drop time. The problem manifestation: Trying to SELECT rows with such writes (for instance, SELECT * ...) ends with AssertionFailure or NPE. Furthermore, the problem is persistent: it is not remedied by compaction and does not require schema disagreement to occur. Bug details: Post-drop-writes are created in two ways: 1. There's a race condition between two nodes, where a write is processed by a node that is unaware of the schema change, thus acquiring a larger timestamp. 2. Another race condition can occur within a node, where writes only verify the column's existence. However, there's no synchronization between column drops and writes. Hence, even with single-node clusters, we can still encounter post-drop-writes. I assume such writes are unavoidable now (and possibly also with TCM; at least in it's first incarnation). These post-drop-writes cause issues during reads, as different parts of the read path treat such writes differently. Importantly, these discrepancies occur without any schema disagreement or similar race conditions: 1. ColumnFilter ColumnFilter answers i.a. a question whether a specific column should be fetched. In the presence of dropped columns, the answer to this question differs depending on how it is asked. Specifically, there is a discrepancy between ColumnFilter::fetches() and ColumnFilter::fetchedColumns(). fetches(someDroppedColumn) may be returning true, but the fetchedColumns() do not contain such a column. While surprising, counter-intuitive, and error-prone, it's not immediately problematic as far as correctness is concerned: Currently, it is `ColumnFilter::fetches` that is being used to tell whether data for a particular column should be skipped or not during deserialization. The reason that `ColumnFilter::fetches` may be returning true for each invocation regardless of the actual column is performance. So, for the time being, let's assume that a ColumnFilter is allowed to let in data of dropped columns and that such case should be properly documented etc. etc. 2. UnfilteredSerializer::read[Complex|Simple]Column + DeserializationHelper DeserializationHelper::isDropped and DeserializationHelper::isDroppedComplexDeletion help UnfilteredSerializer skip data that was written *before* column drop time. This implies that writes to dropped columns are expected at this point (or does it?). At least, that explains why the pre-drop-writes do not cause issues, even though ColumnFilter lets them in. 3. Row::Merger::ColumnDataReducer ColumnDataReducer has an optimization that if the schema does not contain complex columns, the complexBuilder is not constructed (==null). However, since we can read a write to a dropped complex column, we eventually hit NPE in getReduced, when we attempt to use the complexBuilder. While this NPE (CASSANDRA-18589) can be fixed easily, it suggests that we may already be expecting no post-drop-writes at this point in the code. 4. Unfiltered::serializeRowBody serializeRowBody explicitly asserts that the SerializationHelper is aware of a column we're about to serialize. This is not true for writes to dropped columns, causing CASSANDRA-18591. There might be other places in the read path that I haven't identified, which also make different assumptions about post-drop-writes. So, how do we resolve these issues? The most straightforward solution, which I do not discount, is to introduce and document a system limitation: Abstain from dropping columns or making schema changes during writes. But if we wish to tackle this issue programmatically, we need to consider various scenarios during read operations: When performing read: 1. the coordinator and replica share the same schema 2. the coordinator is aware of the dropped column but not the replica 3. the replica is aware of the dropped column, but not the coordinator 4. the coordinator is aware of the dropped column and its subsequent re-addition (both `columns` and `droppedColumns` in `TableMetadata` contain a column with the same name), 5. the coordinator is aware of the dropped column and its subsequent column re-addition, but the replica only knows about column drop 6. the replica is aware of the dropped column and its subsequent re-addition, but not the coordinator is not 7. The replica is aware of the dropped column and its subsequent re-addition, but the coordinator is only aware of the drop It is unclear to me what should happen in each case, especially considering that we want to (as I assume) avoid resurrecting writes. I've got several ideas, but first, I'd like to confirm that my thinking about the problem resonates with you. Does it? Finally, let's consider the role of Transactional Cluster Metadata (TCM). With TCM, distinguishing and handling some of the aforementioned scenarios may become easier or even feasible. However, my familiarity with TCM internals and future plans is limited. Therefore, I would greatly appreciate any insights into the relationship between TCM and scenarios such as the ones I've described above. -- Jakub Zytka e. jakub.zy...@datastax.com w. www.datastax.com