[ 
https://issues.apache.org/jira/browse/CASSANDRA-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280846#comment-17280846
 ] 

Alex Petrov commented on CASSANDRA-16307:
-----------------------------------------

The problem here turned out to be that the Group By pager incorrectly interacts 
with SRP. 

In short, we have multiple implementations of paging counters: one is regular 
CQL one and, the other is Group By. CQL one is counting rows as it sees them. 
Group By is counting rows its going to return (in other words, groups) only 
when it encounters the next group (or partition/data end).

The problem is that when issuing SRP, we rely on correctness of these counters. 
And if counter says that iterator couldn’t give enough data to exhaust limits, 
we’re not going to ask for more data. In case of regular counters it all works 
as expected, but with group by counters, we’ll say that we have seen 1 row less 
every time, and won’t issue SRP. In other words, if one of the nodes holds a 
row followed by the tombstone, current behaviour will prevent SRP from getting 
triggered and tombstone from being visible, since group will be unfinished, and 
it doesn’t make sense to ask the node for more data since it couldn’t even 
satisfy the current data limit.

A straightforward solution to this would have been to just start counting 
groups as we’re encountering them. However, here the problem is that if we say 
that we have a group as soon as we see the first row that belongs to it, 
internal pager will think that there’s no need to continue, since data limits 
are satisfied.

The patch is introducing a distinction between needsMoreContents (better name 
pending) and isDone. This allows us to know we need to fetch more rows in case 
we haven’t seen either end of data, or next row. I think we can even start 
counting groups right away instead of delaying counting till we see the next 
group to (potentially) simplify the logic, however I wanted to leave a larger 
refactoring to a separate patch.

|[trunk|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:16307-trunk]|[CI|https://app.circleci.com/pipelines/github/ifesdjeen/cassandra?branch=16307-trunk]|
|[3.11|https://github.com/apache/cassandra/compare/cassandra-3.11...ifesdjeen:16307-3.11]|[CI|https://app.circleci.com/pipelines/github/ifesdjeen/cassandra?branch=16307-3.11]|

> GROUP BY queries with paging can return deleted data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-16307
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16307
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Andres de la Peña
>            Assignee: Alex Petrov
>            Priority: Normal
>             Fix For: 3.11.x, 4.0-beta
>
>
> {{GROUP BY}} queries using paging and CL>ONE/LOCAL_ONE. This dtest reproduces 
> the problem:
> {code:java}
> try (Cluster cluster = init(Cluster.create(2)))
> {
>     cluster.schemaChange(withKeyspace("CREATE TABLE %s.t (pk int, ck int, 
> PRIMARY KEY (pk, ck))"));
>     ICoordinator coordinator = cluster.coordinator(1);
>     coordinator.execute(withKeyspace("INSERT INTO %s.t (pk, ck) VALUES (0, 
> 0)"), ConsistencyLevel.ALL);
>     coordinator.execute(withKeyspace("INSERT INTO %s.t (pk, ck) VALUES (1, 
> 1)"), ConsistencyLevel.ALL);
>     
>     cluster.get(1).executeInternal(withKeyspace("DELETE FROM %s.t WHERE pk=0 
> AND ck=0"));
>     cluster.get(2).executeInternal(withKeyspace("DELETE FROM %s.t WHERE pk=1 
> AND ck=1"));
>     String query = withKeyspace("SELECT * FROM %s.t GROUP BY pk");
>     Iterator<Object[]> rows = coordinator.executeWithPaging(query, 
> ConsistencyLevel.ALL, 1);
>     assertRows(Iterators.toArray(rows, Object[].class));
> }
> {code}
> Using a 2-node cluster and RF=2, the test inserts two partitions in both 
> nodes. Then it locally deletes each row in a separate node, so each node sees 
> a different partition alive, but reconciliation should produce no alive 
> partitions. However, a {{GROUP BY}} query using a page size of 1 wrongly 
> returns one of the rows.
> This has been detected during CASSANDRA-16180, and it is probably related to 
> CASSANDRA-15459, which solved a similar problem for group-by queries with 
> limit, instead of paging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to