Re: Optimizations for nested aggregations

Adrien Grand Fri, 23 May 2014 02:25:55 -0700

On Fri, May 23, 2014 at 10:50 AM, <[email protected]> wrote:


> I'm doing some benchmarking on aggregations on a dataset of about
> 50.000.000 documents. It's quite a complex aggregation, nesting 5 levels
> deep. At the top level is a terms aggregation, and on the nested levels
> there is a combination of terms, stats and percentiles aggregations on
> deeper levels. With number of shards set to 15 and 20GB of heap on a 3
> cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the
> aggregation takes about 90 seconds, with quite some memory and CPU (about
> 50%) consumption. Doing this experiment leads me to some questions:
>
> - Will setting routing according to the terms in the top-level aggregation
> have any impact? I don't know how the aggregations work, but the assumption
> is that this will make sure all data for the sub aggregations are on the
> same node.
>

What is sure is that it will help with accuracy since you won't suffer from
this limitation of the terms aggregation anymore:
https://github.com/elasticsearch/elasticsearch/issues/1305 (the issue is
for facets, but aggregations work in a similar way).

Regarding performance, it should help since each shard will have 15x fewer
terms to take care of. I said "should" because depending on the cardinality
and distribution of the field of your top terms aggregation, this might
cause your shards to be balanced worse, and the shard with the most
documents might be a bottleneck.

- Does the aggregation operate on the entire document, or does it only load
> the fields that are aggregated?
>

Only fields that are aggregated.


> - Moving from a local dev environment with one node to a three node
> cluster didn't yield as much improvment as I expected. Am I hitting any
> "limits" with regards to aggregations of this complexity?
> - Are there any other optimizations that could affect the performance?
>

Regarding these two last questions, I would recommend to try upgrading to
Elasticsearch 1.2.0 and see if it makes things better. We have put efforts
into improving the memory footprint of multi-level aggregations (which
might in turn improve performance) as well as the performance of terms
aggregations thanks to global ordinals. See
http://www.elasticsearch.org/blog/elasticsearch-1-2-0-released/ for more
information.

Since you mentionned percentiles, I'd like to mention that although useful
they tend to be costly to compute. You can try to play with the compression
(
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html)
of this aggregation to see if it helps performance: it will make the
aggregation faster and more memory-efficient at the cost of some accuracy
loss. Default value is 100, so you can try eg. 10 to see if it yields any
performance improvement.

-- 
Adrien Grand

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j45miVN9qMYNKXr%3DoP8JT5dDOX22B1nBKP_MUFu2gi%2BLg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Optimizations for nested aggregations

Reply via email to