[jira] [Commented] (CASSANDRA-19427) Fix concurrent access of ClientWarn causing AIOBE for SELECT WHERE IN queries with multiple coordinator-local partitions

Abe Ratnofsky (Jira) Tue, 27 Feb 2024 13:43:18 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821428#comment-17821428
 ]


Abe Ratnofsky commented on CASSANDRA-19427:
-------------------------------------------

Had some discussion with [~maedhroz] about CopyOnWriteArrayList vs. 
Collections.synchronizedList. There's a reasonable argument that lock 
acquisition from synchronizedList is better than copying the entire list on 
mutation in CopyOnWriteArrayList, especially since CopyOnWriteArrayList is also 
synchronized when mutating the list. But the list should stay small in number 
of elements (there aren't many client warnings that exist in the first place, 
and unlikely that many will be returned on a single response) and each element 
is small (truncated to a fixed maximum length when added).

The main reason I chose CopyOnWriteArrayList is because of synchronizedList's 
semantics for iteration: 
[https://docs.oracle.com/javase/8/docs/api/java/util/Collections.html#synchronizedList-java.util.List-]

> It is imperative that the user manually synchronize on the returned list when 
>iterating over it

I just don't love these semantics - even though we don't iterate over the list 
until we serialize the warnings into the response, it just feels too easy to 
miss that requirement to synchronize on iteration.

> Fix concurrent access of ClientWarn causing AIOBE for SELECT WHERE IN queries 
> with multiple coordinator-local partitions
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19427
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19427
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Legacy/Local Write-Read Paths
>            Reporter: Abe Ratnofsky
>            Assignee: Abe Ratnofsky
>            Priority: Normal
>             Fix For: 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> On one of our clusters, we noticed rare but periodic 
> ArrayIndexOutOfBoundsExceptions:
>  
> {code:java}
> message="Uncaught exception on thread Thread[ReadStage-3,5,main]"
> exception="java.lang.RuntimeException: 
> java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.ArrayIndexOutOfBoundsException"{code}
>  
>  
> The error was in a Runnable, so the stacktrace didn't directly indicate where 
> the error was coming from. We enabled JFR to log the underlying exception 
> that was thrown:
>  
> {code:java}
> message="Uncaught exception on thread Thread[ReadStage-2,5,main]" 
> exception="java.lang.RuntimeException: 
> java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds 
> for length 0
> at java.base/java.util.ArrayList.add(ArrayList.java:487)
> at java.base/java.util.ArrayList.add(ArrayList.java:499)
> at org.apache.cassandra.service.ClientWarn$State.add(ClientWarn.java:84)
> at 
> org.apache.cassandra.service.ClientWarn$State.access$000(ClientWarn.java:77)
> at org.apache.cassandra.service.ClientWarn.warn(ClientWarn.java:51)
> at 
> org.apache.cassandra.db.ReadCommand$1MetricRecording.onClose(ReadCommand.java:596)
> at 
> org.apache.cassandra.db.transform.BasePartitions.runOnClose(BasePartitions.java:70)
> at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:95)
> at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2260)
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2575)
> ... 6 more"{code}
>  
>  
> An AIOBE on ArrayList.add(E) should only be possible when multiple threads 
> attempt to call the method at the same time.
>  
> This was seen while executing a SELECT WHERE IN query with multiple partition 
> keys. This exception could happen when multiple local reads are dispatched by 
> the coordinator in 
> org.apache.cassandra.service.reads.AbstractReadExecutor#makeRequests. In this 
> case, multiple local reads exceed the tombstone warning threshold, so 
> multiple tombstone warnings are added to the same ClientWarn.State reference. 
>  Currently, org.apache.cassandra.service.ClientWarn.State#warnings is an 
> ArrayList, which isn't safe for concurrent modification, causing the AIOBE to 
> be thrown.
>  
> I have a patch available for this, and I'm preparing it now. The patch is 
> simple - it just changes 
> org.apache.cassandra.service.ClientWarn.State#warnings to a thread-safe 
> CopyOnWriteArrayList. I also have a jvm-dtest that demonstrates the issue but 
> doesn't need to be merged - it shows how a SELECT WHERE IN query with local 
> reads that add client warnings can add to the same ClientWarn.State from 
> different threads. I'll push that in a separate branch just for demonstration 
> purposes.
>  
> Demonstration branch: 
> [https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-demo]
> Fix branch: 
> [https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-fix]
>  (PR linked below)
>  
> This appears to have been an issue since at least 3.11, that was the earliest 
> release I checked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-19427) Fix concurrent access of ClientWarn causing AIOBE for SELECT WHERE IN queries with multiple coordinator-local partitions

Reply via email to