[
https://issues.apache.org/jira/browse/CASSANDRA-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089930#comment-18089930
]
Dmitry Konstantinov commented on CASSANDRA-21468:
-------------------------------------------------
The test emulates cancelling of a read command on a replica by delaying its
execution using an injected delay:
{code:java}
INFO [node1_isolatedExecutor:1] 2026-06-18T10:16:40,893
SubstituteLogger.java:222 - DEBUG [node1_isolatedExecutor:1] node1
2026-06-18T10:16:40,889 StorageProxy.java:2767 - Query cancelled
(timeout)org.apache.cassandra.exceptions.QueryCancelledException: Query
cancelled for taking too long: SELECT * FROM distributed_test_keyspace.tbl
WHERE id = 1 AND ck1 = 77 ALLOW FILTERING at
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.maybeCancel(ReadCommand.java:885)
at
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.applyToPartition(ReadCommand.java:860)
at
org.apache.cassandra.db.ReadCommand$QueryCancellationChecker.applyToPartition(ReadCommand.java:837)
at
org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:94)
at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:344)
at
org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:262)
at
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:237)
at
org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:58)
at
org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:446) at
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2755)
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:3154)
at
org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:168)
at
org.apache.cassandra.concurrent.SEPExecutor.maybeExecuteImmediately(SEPExecutor.java:216)
at
org.apache.cassandra.concurrent.Stage.maybeExecuteImmediately(Stage.java:130)
at
org.apache.cassandra.service.reads.AbstractReadExecutor.makeRequests(AbstractReadExecutor.java:168)
at
org.apache.cassandra.service.reads.AbstractReadExecutor.makeFullDataRequests(AbstractReadExecutor.java:123)
at
org.apache.cassandra.service.reads.AbstractReadExecutor.executeAsync(AbstractReadExecutor.java:185)
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2676) at
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:2559)
at
org.apache.cassandra.service.StorageProxy.dispatchReadWithRetryOnDifferentSystem(StorageProxy.java:2462)
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:2191)
at
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1437)
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:525)
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:430)
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:160)
at
org.apache.cassandra.distributed.impl.CoordinatorHelper.unsafeExecuteInternal(CoordinatorHelper.java:70)
at
org.apache.cassandra.distributed.impl.CoordinatorHelper.unsafeExecuteInternal(CoordinatorHelper.java:48)
at
org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:127)
at
org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:64)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at
org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829) at
org.apache.cassandra.concurrent.CassandraThread.run(CassandraThread.java:82)
{code}
Based on it the replica answers with a error to a coordinator.
Coordinator logic waits for an answer from a replica (1 replica in the current
test).
If the local replica is executed in a different thread then the coordinator
does not wait enough to get an timeout answer from a replica and classifies the
result as timeout due to lack of any responses and returns
ReadTimeoutException. An RequestFailure.UNKNOWN answer from a local replica
executed in another thread is returned later and is not taken in account by the
coordinator logic.
If the local replica is executed in the same thread then the coordinator logic
is executed only after it and coordinator sees the local replica response.
StorageProxy.LocalReadRunnable#runMayThrow logic classifies
QueryCancelledException in a generic way as a RequestFailure.UNKNOWN. As a
result coordinator classified the overall result as ReadFailureException (no
timeout on coordinator awaiting + RequestFailure.UNKNOWN).
The issue itself was not introduced by CASSANDRA-21429 but change in
CASSANDRA-21429 increased the chances to use the same thread to read and
coordinate, so the test has started to fail.
One possible way to fix the issue is to classify the request failure in a more
accurate way as RequestFailure.TIMEOUT instead of RequestFailure.UNKNOWN if we
got QueryCancelledException on a replica.
> Test failure:
> org.apache.cassandra.distributed.test.TimeoutAbortTest.timeoutTest
> ---------------------------------------------------------------------------------
>
> Key: CASSANDRA-21468
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21468
> Project: Apache Cassandra
> Issue Type: Bug
> Components: CI, Local/Other
> Reporter: Sam Tunnicliffe
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 6.x, 7.x
>
> Attachments: image-2026-06-17-11-51-32-807.png,
> image-2026-06-17-11-56-27-080.png
>
>
> Observed in 6.0 & trunk runs since: [46|
> https://ci-cassandra.apache.org/job/Cassandra-6.0/46/testReport/junit/org.apache.cassandra.distributed.test/TimeoutAbortTest/],
>
> [2508|https://ci-cassandra.apache.org/job/Cassandra-trunk/2508/testReport/org.apache.cassandra.distributed.test/TimeoutAbortTest]
> {{git bisect}} claims the regression was introduced by
> {code}
> commit 88aa5b6807dbd97446d34864e87a34493880358b (HEAD)
> Author: Dmitry Konstantinov <[email protected]>
> Date: Sat Jun 6 18:56:14 2026 +0100
> SEPExecutor.maybeExecuteImmediately does not always execute tasks
> immediately despite available worker capacity
> Additional improvement: use a wait-free logic to return a task or work
> permit
> patch by Dmitry Konstantinov; reviewed by Benedict Elliott Smith for
> CASSANDRA-21429
> {code}
> cc [~dnk]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]