[
https://issues.apache.org/jira/browse/PHOENIX-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155770#comment-15155770
]
James Taylor commented on PHOENIX-2700:
---------------------------------------
I think if sliding windows are implemented correctly, they could solve this
issue. Any sliding window implementation would need to handle the boundary case
correctly (i.e. parallelized scans and region boundaries). I'm thinking
something along the lines of returning the window state to the client when the
boundary is encountered (with the prior aggregated state - the count in this
case - returned as well). Then the final merge done by the client would need to
take this boundary case into account. I supposed we'd need both the initial
window state plus the final window state to handle this general boundary
straddling case.
> Push down count(group by key) queries
> -------------------------------------
>
> Key: PHOENIX-2700
> URL: https://issues.apache.org/jira/browse/PHOENIX-2700
> Project: Phoenix
> Issue Type: Bug
> Reporter: James Taylor
>
> Queries that attempt to detect duplicates potentially return a lot of data to
> the client if the column being deduped is near unique. For example:
> {code}
> SELECT SUM(DUP_COUNT)
> FROM (
> SELECT DEDUP_KEY, COUNT(1) As DUP_COUNT
> FROM TABLE_TO_DEDUP
> GROUP BY DEDUP_KEY
> )
> WHERE DUP_COUNT > 1
> {code}
> If all of the following are true, then we can detect duplicates on the region
> server in our coprocessors instead of returning every unique DEDUP_KEY to the
> client for a final merge:
> - each scan won't be split on the same DEDUP_KEY
> - the DEDUP_KEY is the leading primary key column
> - we can push the DUP_COUNT > 1 evaluation through our coprocessor
> The first requirement is the hardest, but potentially there could be a custom
> split policy added.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)