[ 
https://issues.apache.org/jira/browse/CASSANDRA-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233060#comment-17233060
 ] 

Adam Holmberg edited comment on CASSANDRA-16097 at 11/16/20, 10:08 PM:
-----------------------------------------------------------------------

The basic symptom is that we have a read request that has "finished" with no 
data and no failure. The read executor is trying to get data, and we trip on 
the assertion. We are arriving there under the following conditions:

N=2, RF=2, read ONE
The read will fail on the local node due to tombstone read threshold.

There is a 
[race|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814]
 between async execution of the local replica, and the decision to send a spec 
exec. If the local failure takes long enough, a spec exec is triggered, and the 
[contacts list is 
updated|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814].
 Meanwhile, the local request fails and the [callback is 
signaled|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L170-L171].
 When we 
[awaitResults|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L101-L103],
 we find a signaled callback, but {{blockfor(1) + failures(1)}} is not greater 
than the contacts as updated by the spec exec. We thus return success with a 
resolver that has no data.

The proposed patch makes this logic depend on the actual responses, and 
presence of data:
https://github.com/aholmberg/cassandra/pull/17

I also added a couple of assertions because we were violating some assumptions 
along the way before tripping on the one described in this ticket.

[ci|https://app.circleci.com/pipelines/github/aholmberg/cassandra?branch=CASSANDRA-16097]


was (Author: aholmber):
The basic symptom is that we have a read request that has "finished" with no 
data and no failure. The read executor is trying to get data, and we trip on 
the assertion. We are arriving there under the following conditions:

N=2, RF=2, read ONE
The read will fail on the local node due to tombstone read threshold.

There is a 
[race|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814]
 between async execution of the local replica, and the decision to send a spec 
exec. If the local failure takes long enough, a spec exec is triggered, and the 
[contacts list is 
updated|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/StorageProxy.java#L1803-L1814].
 Meanwhile, the local request fails and the [callback is 
signaled|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L170-L171].
 When we 
[awaitResults|https://github.com/apache/cassandra/blob/45acc6318ac063eb9553857d0ec0df550f94e627/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L101-L103],
 we find a signaled callback, but {{blockfor(1) + failures(1)}} is not greater 
than the contacts as updated by the spec exec. We thus return success with a 
resolver that has no data.

The proposed patch makes this logic depend on the actual responses, and 
presence of data:
https://github.com/aholmberg/cassandra/pull/17

I also added a couple of assertions because we were violating some assumptions 
along the way before tripping on the one described in this ticket.

I'll keep this in-progress until I have a CI run complete.

> DigestResolver.getData throws AssertionError since dataResponse is null
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-16097
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16097
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: David Capwell
>            Assignee: Adam Holmberg
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> Was running a benchmark at LOCAL_ONE and eventually saw the below exception
> {code}
> 2020-09-02 21:08:59,872 ERROR [Native-Transport-Requests-35] 
> org.apache.cassandra.transport.Message - Unexpected exception during request; 
> channel = [id: 0x13bb89d4, L:/10.14.92.74:9042 - R:/10.14.89.248:47112]
> java.lang.AssertionError
>        at 
> org.apache.cassandra.service.reads.DigestResolver.getData(DigestResolver.java:77)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:390)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1821) 
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1711) 
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1628) 
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1097)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:294)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:246)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:88)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:216)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:498)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:476)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:138)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.transport.Message$Request.execute(Message.java:253) 
> ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.transport.Message$Dispatcher.processRequest(Message.java:725)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> org.apache.cassandra.transport.Message$Dispatcher.lambda$channelRead0$0(Message.java:630)
>  ~[apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>  [?:?]
>        at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
>  [apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) 
> [apache-cassandra-4.0.0-beta3.jar:4.0.0-beta3]
>        at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  [netty-all-4.1.50.Final.jar:4.1.50.Final]
>        at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
> {code}
> This exception was not frequent, out of the whole run (3h) only saw this 
> twice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to