[
https://issues.apache.org/jira/browse/PHOENIX-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452951#comment-15452951
]
Lars Hofhansl commented on PHOENIX-3225:
----------------------------------------
A colleague of mine points out that we can also approximate the number of
distinct values with HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog.
Seems easy enough to do in principle.
Would need to invent a special UDF or syntax for that. Perhaps
COUNT_DISTINCT_APPROX or something.
> Distinct Queries are slower than expected at scale.
> ---------------------------------------------------
>
> Key: PHOENIX-3225
> URL: https://issues.apache.org/jira/browse/PHOENIX-3225
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: Lars Hofhansl
>
> In our large scale tests we found that we can easily sort 400G on a few 100
> machines, but that a simple DISTINCT would just time out. Perhaps that's
> expected as we have to keep track of the unique values, but we should
> investigate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)