[jira] [Commented] (COUCHDB-2971) Provide cardinality estimate (COUNT DISTINCT) as builtin reducer

Adam Kocoloski (JIRA) Mon, 21 Mar 2016 17:11:58 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205408#comment-15205408
 ]


Adam Kocoloski commented on COUCHDB-2971:
-----------------------------------------

Exactly! I took a shot at implementing this yesterday evening and got the 
basics working. The crux of the implementation is exactly as you describe:

{code}
count_distinct_keys(reduce, KVs) ->
    lists:foldl(fun([[Key, _Id], _Value], Filter) ->
        hyper:insert(term_to_binary(Key), Filter)
    end, hyper:new(Precision), KVs);
count_distinct_keys(rereduce, Reds) ->
    hyper:union([Filter || [_, Filter] <- Reds]).
{code}

You're right that we need a "finalize" operation to estimate the cardinality 
given the final HLL object. The couch_query_servers API doesn't make this easy 
- the `rereduce` function is used internally by the btree reducer as well as 
via the public `fold_reduce` function. I ended up adding a `finalize/1` 
function and calling it in the two places that mattered. I'll try to post my 
WIP in the next 24 hours.

The unions are a fairly expensive operation and so we need to be extra careful 
to keep the tree wide and shallow. There's an interesting interplay here 
between the precision of the filter and the chunking threshold for the tree; 
this is one case where the default chunkify function is almost certainly not 
optimal.

It'll be more code, but I thought it might be nicer to allow the precision of 
the filter to be specified as an option in the view specification in the design 
doc instead of creating 12 "different" builtins.

> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
>                 Key: COUCHDB-2971
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2971
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Adam Kocoloski
>
> We’ve seen a number of applications now where a user needs to count the 
> number of unique keys in a view. Currently the recommended approach is to add 
> a trivial reduce function and then count the number of rows in a _list 
> function or client-side application code, but of course that doesn’t scale 
> nicely.
> It seems that in a majority of these cases all that’s required is an 
> approximation of the number of distinct entries, which brings us into the 
> space of hash sets, linear probabilistic counters, and the ever-popular 
> “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a 
> nice candidate for a builtin reduce. The size of the data structure is 
> independent of the number of input elements and individual HLL filters can be 
> unioned together. There’s already what seems to be a good MIT-licensed 
> implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions; 
> it’d only give the correct result for the exact key. I don’t think that 
> should preclude us from evaluating it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COUCHDB-2971) Provide cardinality estimate (COUNT DISTINCT) as builtin reducer

Reply via email to