Re: Accuracy on cardinality aggregate

Dror Atariah Tue, 25 Nov 2014 05:29:25 -0800

Hi Adrien,

I have two comments/questions:


1) For me, the documentation is still somehow confusing, and the difference 
between the *cardinality* and *value_count* aggregations is not 100% clear.

2) When it comes to counting unique values: I believe that the only way 
that one can take, at the moment, is to use the *cardinality* aggregation. 
This, however, comes with the price of an approximated result (as discussed 
in the documentation and in the paper describing HyperLogLog++). I 
understand the need to take an approximating approach; but I think that the 
returned result should indicate a bound on the error. Otherwise, the 
returned count could be considered useless. In the documentation the figure 
5% is mentioned --- is it independent of the cardinality? what happens to 
this bound when the precision threshold is >> 40,000?

Thanks for your time,
Dror

On Tuesday, April 1, 2014 9:50:30 AM UTC+2, Adrien Grand wrote:
>
> Hi Henrik,
>
> Indeed, there is no way to compute exact unique counts. The reason why we 
> don't expose such a feature is that it would be very costly. In your case, 
> the cardinality is not too large so the terms aggregation helped compute 
> the number of unique values but if the actual cardinality had been very 
> large (eg. 100M), it is very likely that trying to use the terms agg to do 
> so would have required a lot of memory (maybe triggering out-of-memory 
> errors on your nodes), been very slow and caused a lot of network traffic. 
> We will try to clarify this through documentation or a blog post soon.
>
> Thanks for trying out this new aggregation!
>
>
>
> On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik <[email protected] 
> <javascript:>> wrote:
>
>> Ah, so there is currently not easy way of getting exact unique counts out 
>> of elasticsearch?
>>
>> I found a manual way of doing it:
>>
>> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ 
>> "facets": { "a": {  "terms": { "field": "screen_name", "size": 
>> 200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' | 
>> ./jq '.facets.a.terms | length'
>> 145474 (vs 145541)
>> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ 
>> "facets": { "a": {  "terms": { "field": "screen_name", "size": 
>> 200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' | 
>> ./jq '.facets.a.terms | length'
>> 50949 (vs 50824)
>>
>> So the count is quite close! Thank you.
>>
>>
>>
>> On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:
>>>
>>> value_count is the total number of values extracted per bucket. This 
>>> example might help:
>>>
>>> https://gist.github.com/bly2k/9843335
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/96f6d854-466b-46a2-8387-64e785db95e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Accuracy on cardinality aggregate

Reply via email to