kocolosk opened a new pull request #1346: Add _approx_count_distinct as a 
builtin reduce function
URL: https://github.com/apache/couchdb/pull/1346
 
 
   ## Overview
   
   We’ve seen a number of applications now where a user needs to count the 
number of unique keys in a view. Currently the recommended approach is to add a 
trivial reduce function and then count the number of rows in a _list function 
or client-side application code, but of course that doesn’t scale nicely.
   
   It seems that in a majority of these cases all that’s required is an 
approximation of the number of distinct entries, which brings us into the space 
of hash sets, linear probabilistic counters, and the ever-popular “HyperLogLog” 
algorithm. Taking HLL specifically, this seems like quite a nice candidate for 
a builtin reduce. The size of the data structure is independent of the number 
of input elements and individual HLL filters can be unioned together. There’s 
already what seems to be a good MIT-licensed implementation on GitHub:
   
   https://github.com/GameAnalytics/hyper
   
   One caveat is that this reducer would not work for group_level reductions; 
it’d only give the correct result for the exact key. I don’t think that should 
preclude us from evaluating it.
   
   This PR adds a new `_approx_count_distinct` function which uses the binary 
HLL implementation above. It has the precision fixed to 11 which yields a 
relative error of 2%. A future improvement would make this precision 
configurable in the design document.
   
   See https://issues.apache.org/jira/browse/COUCHDB-2971 for history.
   
   ## Testing recommendations
   
   1. Generate a view with a known number of distinct keys.
   2. Add `_approx_count_distinct` as a reduce function.
   3. Compare the reduce output with the exact number of distinct keys. The 
result should be within 2%.
   
   ## Related Issues or Pull Requests
   
   See https://issues.apache.org/jira/browse/COUCHDB-2971 for history.
   
   ## Checklist
   
   - [x] Code is written and works correctly;
   - [ ] Changes are covered by tests;
   - [x] Documentation reflects the changes (separate PR forthcoming).
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to