Hi,
I'm trying out the new cardinality aggregation, and want to measure the 
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m 
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
  "size": 0,
  "aggs": {
    "country_count": {
      "terms": {
        "field": "lang"
      },
      "aggs": {
       "unique_count" : { "value_count" : { "field" : "screen_name" } }
      }
    }
  }
}

Result:
  "aggregations": {
      "country_count": {
         "buckets": [
            {
               "key": "en",
               "doc_count": 872906,
               "unique_count": {
                  "value": 307489
               }
            },
            {
               "key": "ja",
               "doc_count": 581521,
               "unique_count": {
                  "value": 103035
               }
            },


To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
  "size": 0,
  "aggs": {
    "country_count": {
      "terms": {
        "field": "lang"
      },
      "aggregations": {
        "distinct_users_approx": {
          "cardinality": {
            "field": "screen_name",
            "precision_threshold": 40000
          }
        }
      }
    }
  }
}

Result:
   "aggregations": {
      "country_count": {
         "buckets": [
            {
               "key": "en",
               "doc_count": 872906,
               "distinct_users_approx": {
                  "value": 145541
               }
            },
            {
               "key": "ja",
               "doc_count": 581521,
               "distinct_users_approx": {
                  "value": 50824
               }
            },

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not 
very accurate.

1) Am I doing the reference unique count distinct correctly?
2) Is it supposed to be this inaccurate on this type of dataset?
3) Is there any way to improve precision?

-
Henrik

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/91eead45-319c-4a72-81a9-bad214a3ee61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to