[ 
https://issues.apache.org/jira/browse/CASSANDRA-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072103#comment-18072103
 ] 

Andrés Beck-Ruiz commented on CASSANDRA-21216:
----------------------------------------------

After further investigation, it appears that this bug surfaces when internode 
message deserialization of a large {{READ_REQ}} fails. The failure throws [this 
"Unknown column" 
exception|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L489],
 causing the original request to fail. When this happens, {{savedBuffer}} and 
{{savedNextKey}} objects in the {{FastBuilder}} are not cleaned up and contain 
stale {{ColumnMetadata}} objects. If the same thread that fails to deserialize 
the message then picks up the same {{FastBuilder}} during a mutation, it can 
corrupt an existing BTree of {{Row}} objects with stale {{ColumnMetadata}} 
objects, causing a {{ClassCastException}} when the partition is read from or 
written to.

I was able to verify this theory by using the reproduction method described on 
this ticket with added instrumentation. First, I added a function to log an 
error when a BTree of {{Row}} objects contained {{ColumnMetadata}} objects 
within the [addAllWithSizeDeltaInternal 
function|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java#L148],
 which is called on the write path. I noticed that the number of 
{{ColumnMetadata}} objects in the corrupt tree was equal to 32 every time. This 
led to the original theory that the {{{}savedBuffer{}}}, which contains 31 
elements, and {{{}savedNextKey{}}}, which contains 1 element, were somehow 
being emptied out into the {{Row}} BTree. I then added instrumentation to log 
an error when the {{savedBuffer}} and {{savedNextKey}} objects were not null 
for a retrieved {{{}FastBuilder{}}}. I noticed “Unknown column” exceptions 
during deserialization first, then {{FastBuilder}} error logs I added produced 
from the same thread that threw the deserialization exceptions. Milliseconds 
later, these logs were followed by {{ClassCastException}} causing failed reads 
and writes. I observed this pattern on several bug reproduction runs. 

A possible fix to this issue that I’ve added 
[here|https://github.com/andresbeckruiz/cassandra/commit/14dbac67bee3917ce71cd18dc48ef19f5f0cf649]
 and verified builds successfully on Cassandra 4.1 is to clear the 
{{savedBuffer}} and {{savedNextKey}} in the {{FastBuilder.reset}} function, as 
is done in the 
[AbstractUpdater|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3371-L3372];
 I also noticed that AbstractUpdater does not set {{savedNextKey}} to null, so 
this could be worth adding as well. I’ve verified with three reproduction test 
runs that this prevents the bug from resurfacing, even after seeing over 100 
“Unknown column” exceptions during internode large message deserialization.

I will post a discussion thread shortly with more details. 

> ClassCastException thrown on read and write paths after schema modification
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21216
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21216
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Isaac Reath
>            Assignee: Andrés Beck-Ruiz
>            Priority: Normal
>             Fix For: 4.1.x
>
>         Attachments: ClassCastException-Read.txt, 
> ClassCastException-Writes-1.txt
>
>
> After a schema modification on a cluster serving reads and writes, we have 
> noticed that it is possible to see ClassCastException being thrown when 
> comparing clustering keys in a few codepaths (see attached stacktraces). 
> After a schema modification, we see regular ClassCastExceptions happening on 
> both the read and write paths. 
> The table in question is a very wide table, ~4200 columns and it occurred 
> originally on a cluster with 30 nodes running 4.1.3.
> I've been able to reproduce it on a 3 node cluster running 4.1.10 with a 
> single data center that is doing ~60 req/sec and frequent concurrent schema 
> modifications. It appears to be a race as it doesn't happen on every schema 
> change, but will occur at a rate of roughly 1/200 schema changes while the 
> read/write workload is ongoing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to