[ 
https://issues.apache.org/jira/browse/CASSANDRA-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089930#comment-18089930
 ] 

Dmitry Konstantinov commented on CASSANDRA-21468:
-------------------------------------------------

The test emulates cancelling of a read command on a replica by delaying its 
execution using an injected delay:
{code:java}
INFO  [node1_isolatedExecutor:1] 2026-06-18T10:16:40,893 
SubstituteLogger.java:222 - DEBUG [node1_isolatedExecutor:1] node1 
2026-06-18T10:16:40,889 StorageProxy.java:2767 - Query cancelled 
(timeout)org.apache.cassandra.exceptions.QueryCancelledException: Query 
cancelled for taking too long: SELECT * FROM distributed_test_keyspace.tbl 
WHERE id = 1 AND ck1 = 77 ALLOW FILTERING       at 
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.maybeCancel(ReadCommand.java:885)
       at 
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.applyToPartition(ReadCommand.java:860)
  at 
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.applyToPartition(ReadCommand.java:837)
  at 
org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:94)
     at 
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:344)
  at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:262)
  at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:237)
 at 
org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:58)   
     at 
org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:446)     at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2755)
      at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:3154)
      at 
org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:168)
    at 
org.apache.cassandra.concurrent.SEPExecutor.maybeExecuteImmediately(SEPExecutor.java:216)
    at 
org.apache.cassandra.concurrent.Stage.maybeExecuteImmediately(Stage.java:130)   
     at 
org.apache.cassandra.service.reads.AbstractReadExecutor.makeRequests(AbstractReadExecutor.java:168)
  at 
org.apache.cassandra.service.reads.AbstractReadExecutor.makeFullDataRequests(AbstractReadExecutor.java:123)
  at 
org.apache.cassandra.service.reads.AbstractReadExecutor.executeAsync(AbstractReadExecutor.java:185)
  at 
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2676)  at 
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:2559)   
     at 
org.apache.cassandra.service.StorageProxy.dispatchReadWithRetryOnDifferentSystem(StorageProxy.java:2462)
     at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:2191)  
     at 
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1437)
       at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:525)
       at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:430)
       at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:160)
       at 
org.apache.cassandra.distributed.impl.CoordinatorHelper.unsafeExecuteInternal(CoordinatorHelper.java:70)
     at 
org.apache.cassandra.distributed.impl.CoordinatorHelper.unsafeExecuteInternal(CoordinatorHelper.java:48)
     at 
org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:127)
        at 
org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:64)
    at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)  at 
org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)      at 
org.apache.cassandra.concurrent.CassandraThread.run(CassandraThread.java:82)
 {code}

Based on it the replica answers with a error to a coordinator. 
Coordinator logic waits for an answer from a replica (1 replica in the current 
test).
If the local replica is executed in a different thread then the coordinator 
does not wait enough to get an timeout answer from a replica and classifies the 
result as timeout due to lack of any responses and returns 
ReadTimeoutException. An RequestFailure.UNKNOWN answer from a local replica 
executed in another thread is returned later and is not taken in account by the 
coordinator logic.
If the local replica is executed in the same thread then the coordinator logic 
is executed only after it and coordinator sees the local replica response. 
StorageProxy.LocalReadRunnable#runMayThrow logic classifies 
QueryCancelledException in a generic way as a RequestFailure.UNKNOWN. As a 
result coordinator classified the overall result as ReadFailureException (no 
timeout on coordinator awaiting + RequestFailure.UNKNOWN).

The issue itself was not introduced by CASSANDRA-21429 but change in  
CASSANDRA-21429 increased the chances to use the same thread to read and 
coordinate, so the test has started to fail.

One possible way to fix the issue is to classify the request failure in a more 
accurate way as RequestFailure.TIMEOUT instead of RequestFailure.UNKNOWN if we 
got QueryCancelledException on a replica.

> Test failure: 
> org.apache.cassandra.distributed.test.TimeoutAbortTest.timeoutTest 
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21468
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21468
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: CI, Local/Other
>            Reporter: Sam Tunnicliffe
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 6.x, 7.x
>
>         Attachments: image-2026-06-17-11-51-32-807.png, 
> image-2026-06-17-11-56-27-080.png
>
>
> Observed in 6.0 & trunk runs since: [46| 
> https://ci-cassandra.apache.org/job/Cassandra-6.0/46/testReport/junit/org.apache.cassandra.distributed.test/TimeoutAbortTest/],
>  
> [2508|https://ci-cassandra.apache.org/job/Cassandra-trunk/2508/testReport/org.apache.cassandra.distributed.test/TimeoutAbortTest]
> {{git bisect}} claims the regression was introduced by
>  {code}
> commit 88aa5b6807dbd97446d34864e87a34493880358b (HEAD)
> Author: Dmitry Konstantinov <[email protected]>
> Date:   Sat Jun 6 18:56:14 2026 +0100
>     SEPExecutor.maybeExecuteImmediately does not always execute tasks 
> immediately despite available worker capacity
>     Additional improvement: use a wait-free logic to return a task or work 
> permit
>     patch by Dmitry Konstantinov; reviewed by Benedict Elliott Smith for 
> CASSANDRA-21429
> {code}
> cc [~dnk]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to