Hi Adrien, I have two comments/questions:
1) For me, the documentation is still somehow confusing, and the difference between the *cardinality* and *value_count* aggregations is not 100% clear. 2) When it comes to counting unique values: I believe that the only way that one can take, at the moment, is to use the *cardinality* aggregation. This, however, comes with the price of an approximated result (as discussed in the documentation and in the paper describing HyperLogLog++). I understand the need to take an approximating approach; but I think that the returned result should indicate a bound on the error. Otherwise, the returned count could be considered useless. In the documentation the figure 5% is mentioned --- is it independent of the cardinality? what happens to this bound when the precision threshold is >> 40,000? Thanks for your time, Dror On Tuesday, April 1, 2014 9:50:30 AM UTC+2, Adrien Grand wrote: > > Hi Henrik, > > Indeed, there is no way to compute exact unique counts. The reason why we > don't expose such a feature is that it would be very costly. In your case, > the cardinality is not too large so the terms aggregation helped compute > the number of unique values but if the actual cardinality had been very > large (eg. 100M), it is very likely that trying to use the terms agg to do > so would have required a lot of memory (maybe triggering out-of-memory > errors on your nodes), been very slow and caused a lot of network traffic. > We will try to clarify this through documentation or a blog post soon. > > Thanks for trying out this new aggregation! > > > > On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik <[email protected] > <javascript:>> wrote: > >> Ah, so there is currently not easy way of getting exact unique counts out >> of elasticsearch? >> >> I found a manual way of doing it: >> >> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ >> "facets": { "a": { "terms": { "field": "screen_name", "size": >> 200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' | >> ./jq '.facets.a.terms | length' >> 145474 (vs 145541) >> curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ >> "facets": { "a": { "terms": { "field": "screen_name", "size": >> 200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' | >> ./jq '.facets.a.terms | length' >> 50949 (vs 50824) >> >> So the count is quite close! Thank you. >> >> >> >> On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote: >>> >>> value_count is the total number of values extracted per bucket. This >>> example might help: >>> >>> https://gist.github.com/bly2k/9843335 >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Adrien Grand > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/96f6d854-466b-46a2-8387-64e785db95e9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
