[ 
https://issues.apache.org/jira/browse/CASSANDRA-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16807:
----------------------------------------
    Test and Documentation Plan: a new time-bound multi-threaded unit test that 
"fuzzes" the problematic behavior described in this issue
                         Status: Patch Available  (was: In Progress)

I'm pushing up a trunk-based version of this fix to get review started, but a 
4.0.x patch would be identical.

[trunk|https://github.com/apache/cassandra/pull/1110]
[j8|https://app.circleci.com/pipelines/github/maedhroz/cassandra/290/workflows/d0ce3435-7247-49fc-a61e-8ca29db7d64e]
[j11|https://app.circleci.com/pipelines/github/maedhroz/cassandra/290/workflows/71e4bfbc-96f9-47d4-8de8-95ae694762ff]

Note that in this run of the tests, I didn't have a time-bound on the new test, 
so the fact that they time out is actually a good thing. (i.e. They don't hit 
the assertion.)

> Weak visibility guarantees of Accumulator lead to failed assertions during 
> digest comparison
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16807
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16807
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.0.x, 4.x
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> This problem could manifest on all versions, beginning on at least 3.0, but 
> I’ll focus on the way it manifests in 4.0 here.
> In what now seems like a wise move, CASSANDRA-16097 added an assertion to 
> {{DigestResolver#responseMatch()}} that ensures the responses snapshot has at 
> least one visible elements to compare (although of course only one element 
> trivially cannot generate a mismatch and short-circuits immediately). 
> However, at the point {{ReadCallback#onResponse()}} signals the waiting 
> resolver, there is no guarantee that the size of the generated snapshot of 
> the responses {{Accumulator}} is non-zero, or perhaps more worryingly, at 
> least equal to the number of blocked-for responses. This seems to be a 
> consequence of the documented weak visibility guarantees on 
> {{Accumulator#add()}}. In short, if there are concurrent invocations of 
> add(), is it not guaranteed that there is any visible size change after any 
> one of them return, but only after all complete.
> The particular exception looks something like this:
> {noformat}
> java.lang.AssertionError: Attempted response match comparison while no 
> responses have been received.
>       at 
> org.apache.cassandra.service.reads.DigestResolver.responsesMatch(DigestResolver.java:110)
>       at 
> org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:393)
>       at 
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2150)
>       at 
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1979)
>       at 
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1882)
>       at 
> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1121)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:296)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:248)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:90)
> {noformat}
> It’s possible to reproduce this on simple single-partition reads without any 
> short-read protection or replica filtering protection. I’ve also been able to 
> reproduce this synthetically with [a unit 
> test|https://github.com/apache/cassandra/pull/1110] on {{ReadCallback}}.
> It seems like the most straightforward way to fix this would be to avoid 
> signaling in {{ReadCallback#onResponse()}} until the visible size of the 
> accumulator is at least the number of received responses. In most cases, this 
> is trivially true, and our signaling behavior won’t change at all. In the 
> very rare case that there are two (or more) concurrent calls to 
> {{onResponse()}}, the second (or last) will signal, and having one more 
> response than we strictly need should have no negative side effects. (We 
> don’t seem to make any strict assertions about having exactly the number of 
> required responses, only that we have enough.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to