[ https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Kocoloski closed COUCHDB-2971. ----------------------------------- > Provide cardinality estimate (COUNT DISTINCT) as builtin reducer > ---------------------------------------------------------------- > > Key: COUCHDB-2971 > URL: https://issues.apache.org/jira/browse/COUCHDB-2971 > Project: CouchDB > Issue Type: Improvement > Reporter: Adam Kocoloski > Priority: Major > Fix For: 2.2 > > Attachments: rebar.config.script > > > We’ve seen a number of applications now where a user needs to count the > number of unique keys in a view. Currently the recommended approach is to add > a trivial reduce function and then count the number of rows in a _list > function or client-side application code, but of course that doesn’t scale > nicely. > It seems that in a majority of these cases all that’s required is an > approximation of the number of distinct entries, which brings us into the > space of hash sets, linear probabilistic counters, and the ever-popular > “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a > nice candidate for a builtin reduce. The size of the data structure is > independent of the number of input elements and individual HLL filters can be > unioned together. There’s already what seems to be a good MIT-licensed > implementation on GitHub: > https://github.com/GameAnalytics/hyper > One caveat is that this reducer would not work for group_level reductions; > it’d only give the correct result for the exact key. I don’t think that > should preclude us from evaluating it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)