[
https://issues.apache.org/jira/browse/CASSANDRA-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072103#comment-18072103
]
Andrés Beck-Ruiz commented on CASSANDRA-21216:
----------------------------------------------
After further investigation, it appears that this bug surfaces when internode
message deserialization of a large {{READ_REQ}} fails. The failure throws [this
"Unknown column"
exception|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L489],
causing the original request to fail. When this happens, {{savedBuffer}} and
{{savedNextKey}} objects in the {{FastBuilder}} are not cleaned up and contain
stale {{ColumnMetadata}} objects. If the same thread that fails to deserialize
the message then picks up the same {{FastBuilder}} during a mutation, it can
corrupt an existing BTree of {{Row}} objects with stale {{ColumnMetadata}}
objects, causing a {{ClassCastException}} when the partition is read from or
written to.
I was able to verify this theory by using the reproduction method described on
this ticket with added instrumentation. First, I added a function to log an
error when a BTree of {{Row}} objects contained {{ColumnMetadata}} objects
within the [addAllWithSizeDeltaInternal
function|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java#L148],
which is called on the write path. I noticed that the number of
{{ColumnMetadata}} objects in the corrupt tree was equal to 32 every time. This
led to the original theory that the {{{}savedBuffer{}}}, which contains 31
elements, and {{{}savedNextKey{}}}, which contains 1 element, were somehow
being emptied out into the {{Row}} BTree. I then added instrumentation to log
an error when the {{savedBuffer}} and {{savedNextKey}} objects were not null
for a retrieved {{{}FastBuilder{}}}. I noticed “Unknown column” exceptions
during deserialization first, then {{FastBuilder}} error logs I added produced
from the same thread that threw the deserialization exceptions. Milliseconds
later, these logs were followed by {{ClassCastException}} causing failed reads
and writes. I observed this pattern on several bug reproduction runs.
A possible fix to this issue that I’ve added
[here|https://github.com/andresbeckruiz/cassandra/commit/14dbac67bee3917ce71cd18dc48ef19f5f0cf649]
and verified builds successfully on Cassandra 4.1 is to clear the
{{savedBuffer}} and {{savedNextKey}} in the {{FastBuilder.reset}} function, as
is done in the
[AbstractUpdater|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3371-L3372];
I also noticed that AbstractUpdater does not set {{savedNextKey}} to null, so
this could be worth adding as well. I’ve verified with three reproduction test
runs that this prevents the bug from resurfacing, even after seeing over 100
“Unknown column” exceptions during internode large message deserialization.
I will post a discussion thread shortly with more details.
> ClassCastException thrown on read and write paths after schema modification
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-21216
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21216
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Isaac Reath
> Assignee: Andrés Beck-Ruiz
> Priority: Normal
> Fix For: 4.1.x
>
> Attachments: ClassCastException-Read.txt,
> ClassCastException-Writes-1.txt
>
>
> After a schema modification on a cluster serving reads and writes, we have
> noticed that it is possible to see ClassCastException being thrown when
> comparing clustering keys in a few codepaths (see attached stacktraces).
> After a schema modification, we see regular ClassCastExceptions happening on
> both the read and write paths.
> The table in question is a very wide table, ~4200 columns and it occurred
> originally on a cluster with 30 nodes running 4.1.3.
> I've been able to reproduce it on a 3 node cluster running 4.1.10 with a
> single data center that is doing ~60 req/sec and frequent concurrent schema
> modifications. It appears to be a race as it doesn't happen on every schema
> change, but will occur at a rate of roughly 1/200 schema changes while the
> read/write workload is ongoing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]