[ 
https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890583#comment-15890583
 ] 

Adam Kocoloski commented on COUCHDB-2971:
-----------------------------------------

OK, at long last I've posted an initial pass at the implementation. Some 
noteworthy bits:

- The precision is currently fixed to 11, so m=2048. This uses ~1.5 KB and has 
a relative error of about 2%
- I simply called this reduce function _distinct, but I'm not sure that's the 
best name. Open to suggestions
- I needed to introduce a {{finalize}} function in order to implement the 
reduce. We could use {{finalize}} to make some of the other builtins more 
efficient too by avoiding repeated conversions back and forth from epson. At 
some point in the future we could also think about directly exposing the 
finalize capability to custom aggregations in e.g. Mango.
- If we do allow customized precisions, what should the API look like for that? 
E.g. do we need an "options" object as part of the view?

{code}
"views": {
  "cardinality": {
    "map" "...",
    "reduce": "_distinct",
    "options": {
      "precision": 12
    }
  }
}
{code}

> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
>                 Key: COUCHDB-2971
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2971
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Adam Kocoloski
>
> We’ve seen a number of applications now where a user needs to count the 
> number of unique keys in a view. Currently the recommended approach is to add 
> a trivial reduce function and then count the number of rows in a _list 
> function or client-side application code, but of course that doesn’t scale 
> nicely.
> It seems that in a majority of these cases all that’s required is an 
> approximation of the number of distinct entries, which brings us into the 
> space of hash sets, linear probabilistic counters, and the ever-popular 
> “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a 
> nice candidate for a builtin reduce. The size of the data structure is 
> independent of the number of input elements and individual HLL filters can be 
> unioned together. There’s already what seems to be a good MIT-licensed 
> implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions; 
> it’d only give the correct result for the exact key. I don’t think that 
> should preclude us from evaluating it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to