[ 
https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890518#comment-15890518
 ] 

ASF subversion and git services commented on COUCHDB-2971:
----------------------------------------------------------

Commit ee32cd5825aaf63448651c9521f0927083d2281e in couchdb-couch's branch 
refs/heads/2971-count-distinct from [~kocolosk]
[ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch.git;h=ee32cd5 ]

Add a cardinality estimator builtin reduce

This introduces a _distinct builtin reduce function, which uses a
HyperLogLog algorithm to estimate the number of distinct keys in the
view index. The precision is currently fixed to 2^11 observables and
therefore uses approximately 1.5 KB of memory.

COUCHDB-2971


> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
>                 Key: COUCHDB-2971
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2971
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Adam Kocoloski
>
> We’ve seen a number of applications now where a user needs to count the 
> number of unique keys in a view. Currently the recommended approach is to add 
> a trivial reduce function and then count the number of rows in a _list 
> function or client-side application code, but of course that doesn’t scale 
> nicely.
> It seems that in a majority of these cases all that’s required is an 
> approximation of the number of distinct entries, which brings us into the 
> space of hash sets, linear probabilistic counters, and the ever-popular 
> “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a 
> nice candidate for a builtin reduce. The size of the data structure is 
> independent of the number of input elements and individual HLL filters can be 
> unioned together. There’s already what seems to be a good MIT-licensed 
> implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions; 
> it’d only give the correct result for the exact key. I don’t think that 
> should preclude us from evaluating it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to