[ 
https://issues.apache.org/jira/browse/PHOENIX-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155385#comment-15155385
 ] 

Julian Hyde commented on PHOENIX-2700:
--------------------------------------

Do you want to detect whether there are any duplicates? Or find the keys that 
have duplicates? It seems strange that you'd want to sum the counts.

One optimization strategy would be to use sliding windows. If the value is the 
same as the previous, you have a duplicate. This avoids the effort of computing 
precise counts, which requires a hash table where most of the keys only have a 
single value. But I don't think Phoenix has sliding windows yet.

> Push down count(group by key) queries
> -------------------------------------
>
>                 Key: PHOENIX-2700
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2700
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Queries that attempt to detect duplicates potentially return a lot of data to 
> the client if the column being deduped is near unique.  For example:
> {code}
> SELECT SUM(DUP_COUNT) 
> FROM ( 
>     SELECT DEDUP_KEY, COUNT(1) As DUP_COUNT
>     FROM TABLE_TO_DEDUP
>     GROUP BY DEDUP_KEY
> )
> WHERE DUP_COUNT > 1
> {code}
> If all of the following are true, then we can detect duplicates on the region 
> server in our coprocessors instead of returning every unique DEDUP_KEY to the 
> client for a final merge:
> - each scan won't be split on the same DEDUP_KEY
> - the DEDUP_KEY is the leading primary key column
> - we can push the DUP_COUNT > 1 evaluation through our coprocessor
> The first requirement is the hardest, but potentially there could be a custom 
> split policy added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to