On Fri, May 23, 2014 at 10:50 AM, <[email protected]> wrote:
> I'm doing some benchmarking on aggregations on a dataset of about > 50.000.000 documents. It's quite a complex aggregation, nesting 5 levels > deep. At the top level is a terms aggregation, and on the nested levels > there is a combination of terms, stats and percentiles aggregations on > deeper levels. With number of shards set to 15 and 20GB of heap on a 3 > cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the > aggregation takes about 90 seconds, with quite some memory and CPU (about > 50%) consumption. Doing this experiment leads me to some questions: > > - Will setting routing according to the terms in the top-level aggregation > have any impact? I don't know how the aggregations work, but the assumption > is that this will make sure all data for the sub aggregations are on the > same node. > What is sure is that it will help with accuracy since you won't suffer from this limitation of the terms aggregation anymore: https://github.com/elasticsearch/elasticsearch/issues/1305 (the issue is for facets, but aggregations work in a similar way). Regarding performance, it should help since each shard will have 15x fewer terms to take care of. I said "should" because depending on the cardinality and distribution of the field of your top terms aggregation, this might cause your shards to be balanced worse, and the shard with the most documents might be a bottleneck. - Does the aggregation operate on the entire document, or does it only load > the fields that are aggregated? > Only fields that are aggregated. > - Moving from a local dev environment with one node to a three node > cluster didn't yield as much improvment as I expected. Am I hitting any > "limits" with regards to aggregations of this complexity? > - Are there any other optimizations that could affect the performance? > Regarding these two last questions, I would recommend to try upgrading to Elasticsearch 1.2.0 and see if it makes things better. We have put efforts into improving the memory footprint of multi-level aggregations (which might in turn improve performance) as well as the performance of terms aggregations thanks to global ordinals. See http://www.elasticsearch.org/blog/elasticsearch-1-2-0-released/ for more information. Since you mentionned percentiles, I'd like to mention that although useful they tend to be costly to compute. You can try to play with the compression ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html) of this aggregation to see if it helps performance: it will make the aggregation faster and more memory-efficient at the cost of some accuracy loss. Default value is 100, so you can try eg. 10 to see if it yields any performance improvement. -- Adrien Grand -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j45miVN9qMYNKXr%3DoP8JT5dDOX22B1nBKP_MUFu2gi%2BLg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
