[
https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892683#comment-15892683
]
Nick Vatamaniuc commented on COUCHDB-2971:
------------------------------------------
Gave it try. Attached my rebar.config.script (the branch of couchdb-2971 in top
repo was a bit outdated).
Filled a db with 10k documents:
{code}
In [5]: db = s.create('db')
In [6]: db['_design/dd1'] = {"views":{"v1":{"map":"function(doc){emit(doc._id,
null) };", "reduce":"_distinct"}}}
In [7]: for i in xrange(10000): db[str(i)] = {}
{code}
{code}
http -b 'http://adm:pass@localhost:15984/db/_design/dd1/_view/v1'
{
"rows": [
{
"key": null,
"value": 9737.75978064411
}
]
}
{code}
About 2.6% difference as expected.
Was curious what group=true output looked like:
{code}
http -b
'http://adm:pass@localhost:15984/db/_design/dd1/_view/v1?limit=2&group=true'
{
"rows": [
{
"key": "0",
"value": 1.0002442201269182
},
{
"key": "1",
"value": 1.0002442201269182
}
]
}
{code}
I like {{_distinct}} but wondering if they'd be confusion why it does not
return exact results (like _count). It is a bit of a leaky abstraction since it
exposes the precision trade-off to the user directly. Maybe
{{_distinct_approx}}, {{_distinct_hll}}, {{_sketch_distinct}} (as in it's a
sketch algorithm, maybe if we add count-min later)?
Options API looks good. Not sure how hard would be to implement it. Would there
be other uses for it? Maybe _stats or user views could find options useful as
well. But having a simple version with a default precision might be a good
start to see how people use it first.
Noticed hyper code has some Erlang 18 only functions but looks like perf report
code only, so no problem there.
As Robert suggested to be consistent we might need to have couchdb-hyper
version. Saw we do that for meck and other external projects.
> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
> Key: COUCHDB-2971
> URL: https://issues.apache.org/jira/browse/COUCHDB-2971
> Project: CouchDB
> Issue Type: Improvement
> Reporter: Adam Kocoloski
> Attachments: rebar.config.script
>
>
> We’ve seen a number of applications now where a user needs to count the
> number of unique keys in a view. Currently the recommended approach is to add
> a trivial reduce function and then count the number of rows in a _list
> function or client-side application code, but of course that doesn’t scale
> nicely.
> It seems that in a majority of these cases all that’s required is an
> approximation of the number of distinct entries, which brings us into the
> space of hash sets, linear probabilistic counters, and the ever-popular
> “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a
> nice candidate for a builtin reduce. The size of the data structure is
> independent of the number of input elements and individual HLL filters can be
> unioned together. There’s already what seems to be a good MIT-licensed
> implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions;
> it’d only give the correct result for the exact key. I don’t think that
> should preclude us from evaluating it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)