[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108647#comment-15108647
 ] 

Benjamin Lerer commented on CASSANDRA-10707:
--------------------------------------------

||patch||utests||dtests||
|[trunk|https://github.com/apache/cassandra/compare/trunk...blerer:10707-trunk]|[trunk|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-10707-trunk-testall/]|[trunk|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-10707-trunk-dtest/]|

The DTest branch is 
[here|https://github.com/riptano/cassandra-dtest/pull/753/files]

The classes {{GroupBySpecification}} and {{GroupSelector}} are used to create 
{{GroupMaker}} instances. A {{GroupMaker}} is used, on a sorted set of rows, to 
determine if a row belongs to the same group as the previous row or not.

For the moment, only one type of {{GroupSelector}} exists for a primary key 
columns. Its serialization mechanism has, nevertheless, been implemented in 
such a way that it will be possible to add new implementations (to allow the 
use of functions in the {{GROUP BY}} clause, for example) without breaking 
backward compatibility.

{{SelectStatement}} and {{Selection}} have been modified in order to use 
{{GroupBySpecification}} and {{GroupMaker}} when building the result set.

Group by queries are always paged internally to avoid {{OOMExceptions}}. Two 
new {{DataLimits}} have been added to manage the group by paging 
{{CQLGroupByLimits}} and {{CQLGroupByPagingLimits}}. They keep track of the 
group count and of the row count to make sure that the processing is stopped as 
soon as one of the limits is reached.

A group is only counted once the next one is reached, as a group can be spread 
over multiple pages. The problem with this approach is that a counter can only 
know if it has reach the group limit when it has reached a row that should not 
be added to the resultset. As multiple counters are used when a request is 
processed the extra row is not filtered out until it reachs the counter of the 
{{QueryPager}}. To do that a special factory method has been added to 
{{DataLimits}}: {{forPagingByQueryPager(int pageSize)}}.

This approach was not working properly in the case of the 
{{MultiPartitionPager}} as an extra counter was added on top of the one of the 
{{SinglePartitionPager}}. To solve that problem the use of the counter in 
{{MultiPartitionPager}} has been replaced by another mechanism.

The internal paging is performed by the {{GroupByQueryPager}} which 
automatically fetch new pages of data when needed. 

As the {{DataLimits}} needs to be updated for each new internal query and the 
{{ReadQuery}} classes are immutable a new {{withUpdatedLimit}} method as been a 
added to all the {{ReadQuery}} classes.

In order to simplify the {{SelectStatement}} code, the patch also modify 
slightly the way queries with aggregates but no {{GROUP BY}} is working.
I implemented it initially on top of the Group by paging but realized afterward 
that it was breaking backward compatibility. We will anyway be able in the 
future to switch back to it. Once we are sure that the group by paging is 
supported by the previous versions.        


> Add support for Group By to Select statement
> --------------------------------------------
>
>                 Key: CASSANDRA-10707
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to