I don't believe value_count is intended to be a unique count.
On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:
>
> Hi,
> I'm trying out the new cardinality aggregation, and want to measure the
> accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
> tweets).
>
> I'm counting the number of unique usernames per language.
> To get my "reference" unique count I use this:
> GET /twitter-2014.03.26/_search
> {
> "size": 0,
> "aggs": {
> "country_count": {
> "terms": {
> "field": "lang"
> },
> "aggs": {
> "unique_count" : { "value_count" : { "field" : "screen_name" } }
> }
> }
> }
> }
>
> Result:
> "aggregations": {
> "country_count": {
> "buckets": [
> {
> "key": "en",
> "doc_count": 872906,
> "unique_count": {
> "value": 307489
> }
> },
> {
> "key": "ja",
> "doc_count": 581521,
> "unique_count": {
> "value": 103035
> }
> },
>
>
> To get the approximate count with cardinality:
> GET /twitter-2014.03.26/_search
> {
> "size": 0,
> "aggs": {
> "country_count": {
> "terms": {
> "field": "lang"
> },
> "aggregations": {
> "distinct_users_approx": {
> "cardinality": {
> "field": "screen_name",
> "precision_threshold": 40000
> }
> }
> }
> }
> }
> }
>
> Result:
> "aggregations": {
> "country_count": {
> "buckets": [
> {
> "key": "en",
> "doc_count": 872906,
> "distinct_users_approx": {
> "value": 145541
> }
> },
> {
> "key": "ja",
> "doc_count": 581521,
> "distinct_users_approx": {
> "value": 50824
> }
> },
>
> So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
> very accurate.
>
> 1) Am I doing the reference unique count distinct correctly?
> 2) Is it supposed to be this inaccurate on this type of dataset?
> 3) Is there any way to improve precision?
>
> -
> Henrik
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.