[jira] [Commented] (PHOENIX-3225) Distinct Queries are slower than expected at scale.

Lars Hofhansl (JIRA) Wed, 31 Aug 2016 11:20:57 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452951#comment-15452951
 ]


Lars Hofhansl commented on PHOENIX-3225:
----------------------------------------

A colleague of mine points out that we can also approximate the number of 
distinct values with HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog. 
Seems easy enough to do in principle.

Would need to invent a special UDF or syntax for that. Perhaps 
COUNT_DISTINCT_APPROX or something.


> Distinct Queries are slower than expected at scale.
> ---------------------------------------------------
>
>                 Key: PHOENIX-3225
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3225
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Lars Hofhansl
>
> In our large scale tests we found that we can easily sort 400G on a few 100 
> machines, but that a simple DISTINCT would just time out. Perhaps that's 
> expected as we have to keep track of the unique values, but we should 
> investigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-3225) Distinct Queries are slower than expected at scale.

Reply via email to