Re: Accuracy on cardinality aggregate

Henrik Nordvik Fri, 28 Mar 2014 13:39:13 -0700

I compared the unique count with the total field of the old terms facet and
it matched. What else would the count be? It is lower than doc count.
On 28 Mar 2014 18:54, "Mark Harwood" <[email protected]> wrote:


> I don't believe value_count is intended to be a unique count.
>
>
>
> On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:
>>
>> Hi,
>> I'm trying out the new cardinality aggregation, and want to measure the
>> accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
>> tweets).
>>
>> I'm counting the number of unique usernames per language.
>> To get my "reference" unique count I use this:
>> GET /twitter-2014.03.26/_search
>> {
>>   "size": 0,
>>   "aggs": {
>>     "country_count": {
>>       "terms": {
>>         "field": "lang"
>>       },
>>       "aggs": {
>>        "unique_count" : { "value_count" : { "field" : "screen_name" } }
>>       }
>>     }
>>   }
>> }
>>
>> Result:
>>   "aggregations": {
>>       "country_count": {
>>          "buckets": [
>>             {
>>                "key": "en",
>>                "doc_count": 872906,
>>                "unique_count": {
>>                   "value": 307489
>>                }
>>             },
>>             {
>>                "key": "ja",
>>                "doc_count": 581521,
>>                "unique_count": {
>>                   "value": 103035
>>                }
>>             },
>>
>>
>> To get the approximate count with cardinality:
>> GET /twitter-2014.03.26/_search
>> {
>>   "size": 0,
>>   "aggs": {
>>     "country_count": {
>>       "terms": {
>>         "field": "lang"
>>       },
>>       "aggregations": {
>>         "distinct_users_approx": {
>>           "cardinality": {
>>             "field": "screen_name",
>>             "precision_threshold": 40000
>>           }
>>         }
>>       }
>>     }
>>   }
>> }
>>
>> Result:
>>    "aggregations": {
>>       "country_count": {
>>          "buckets": [
>>             {
>>                "key": "en",
>>                "doc_count": 872906,
>>                "distinct_users_approx": {
>>                   "value": 145541
>>                }
>>             },
>>             {
>>                "key": "ja",
>>                "doc_count": 581521,
>>                "distinct_users_approx": {
>>                   "value": 50824
>>                }
>>             },
>>
>> So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
>> very accurate.
>>
>> 1) Am I doing the reference unique count distinct correctly?
>> 2) Is it supposed to be this inaccurate on this type of dataset?
>> 3) Is there any way to improve precision?
>>
>> -
>> Henrik
>>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAH3vNzN9ftYTJEnAo3si1GKJk0e2qc%2BRoApXmXB2CB_6bT%3Dysw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Accuracy on cardinality aggregate

Reply via email to