[
https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205902#comment-15205902
]
Nick Vatamaniuc commented on COUCHDB-2971:
------------------------------------------
Ah, good point on having a nicer way to specify precision. Yeah otherwise it
looks kind of hackish.
Noticed they provide various backends for the registers. One is a C NIF. Tried
to compile and run their code on Erlang 18 and had to fiddle with it a bit, but
got it to work and got these results:
https://gist.github.com/nickva/bf19a2b7b537f5051a99
There are some tradeoffs between memory usage, cardinality and union times.
While C array is interesting, having the cheapest union operation (under 1ms),
has cardinality estimation time greater than a few milliseconds which might not
play well with the Erlang schedulers. But if it happens only during the
finalize stage it could be handled in another way (some thread + queue
mechanism). Unfortunately it also has a large/constant memory usage for low
cardinalities.
> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
> Key: COUCHDB-2971
> URL: https://issues.apache.org/jira/browse/COUCHDB-2971
> Project: CouchDB
> Issue Type: Improvement
> Reporter: Adam Kocoloski
>
> We’ve seen a number of applications now where a user needs to count the
> number of unique keys in a view. Currently the recommended approach is to add
> a trivial reduce function and then count the number of rows in a _list
> function or client-side application code, but of course that doesn’t scale
> nicely.
> It seems that in a majority of these cases all that’s required is an
> approximation of the number of distinct entries, which brings us into the
> space of hash sets, linear probabilistic counters, and the ever-popular
> “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a
> nice candidate for a builtin reduce. The size of the data structure is
> independent of the number of input elements and individual HLL filters can be
> unioned together. There’s already what seems to be a good MIT-licensed
> implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions;
> it’d only give the correct result for the exact key. I don’t think that
> should preclude us from evaluating it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)