[
https://issues.apache.org/jira/browse/CASSANDRA-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233060#comment-17233060
]
Adam Holmberg edited comment on CASSANDRA-16097 at 11/16/20, 10:08 PM:
-----------------------------------------------------------------------
The basic symptom is that we have a read request that has "finished" with no
data and no failure. The read executor is trying to get data, and we trip on
the assertion. We are arriving there under the following conditions:
N=2, RF=2, read ONE
The read will fail on the local node due to tombstone read threshold.
There is a
[race|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814]
between async execution of the local replica, and the decision to send a spec
exec. If the local failure takes long enough, a spec exec is triggered, and the
[contacts list is
updated|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814].
Meanwhile, the local request fails and the [callback is
signaled|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L170-L171].
When we
[awaitResults|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L101-L103],
we find a signaled callback, but {{blockfor(1) + failures(1)}} is not greater
than the contacts as updated by the spec exec. We thus return success with a
resolver that has no data.
The proposed patch makes this logic depend on the actual responses, and
presence of data:
https://github.com/aholmberg/cassandra/pull/17
I also added a couple of assertions because we were violating some assumptions
along the way before tripping on the one described in this ticket.
[ci|https://app.circleci.com/pipelines/github/aholmberg/cassandra?branch=CASSANDRA-16097]
was (Author: aholmber):
The basic symptom is that we have a read request that has "finished" with no
data and no failure. The read executor is trying to get data, and we trip on
the assertion. We are arriving there under the following conditions:
N=2, RF=2, read ONE
The read will fail on the local node due to tombstone read threshold.
There is a
[race|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814]
between async execution of the local replica, and the decision to send a spec
exec. If the local failure takes long enough, a spec exec is triggered, and the
[contacts list is
updated|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814].
Meanwhile, the local request fails and the [callback is
signaled|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L170-L171].
When we
[awaitResults|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L101-L103],
we find a signaled callback, but {{blockfor(1) + failures(1)}} is not greater
than the contacts as updated by the spec exec. We thus return success with a
resolver that has no data.
The proposed patch makes this logic depend on the actual responses, and
presence of data:
https://github.com/aholmberg/cassandra/pull/17
I also added a couple of assertions because we were violating some assumptions
along the way before tripping on the one described in this ticket.
I'll keep this in-progress until I have a CI run complete.
> DigestResolver.getData throws AssertionError since dataResponse is null
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-16097
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16097
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Coordination
> Reporter: David Capwell
> Assignee: Adam Holmberg
> Priority: Normal
> Fix For: 4.0-beta
>
>
> Was running a benchmark at LOCAL_ONE and eventually saw the below exception
> {code}
> 2020-09-02 21:08:59,872 ERROR [Native-Transport-Requests-35]
> org.apache.cassandra.transport.Message - Unexpected exception during request;
> channel = [id: 0x13bb89d4, L:/10.14.92.74:9042 - R:/10.14.89.248:47112]
> java.lang.AssertionError
> at
> org.apache.cassandra.service.reads.DigestResolver.getData(DigestResolver.java:77)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:390)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1821)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1711)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1628)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1097)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:294)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:246)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:88)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:216)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:498)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:476)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:138)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.transport.Message$Request.execute(Message.java:253)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.transport.Message$Dispatcher.processRequest(Message.java:725)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> org.apache.cassandra.transport.Message$Dispatcher.lambda$channelRead0$0(Message.java:630)
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [?:?]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
> [apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> [apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> [netty-all-4.1.50.Final.jar:4.1.50.Final]
> at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
> {code}
> This exception was not frequent, out of the whole run (3h) only saw this
> twice.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]