I'm doing some benchmarking on aggregations on a dataset of about 50.000.000 documents. It's quite a complex aggregation, nesting 5 levels deep. At the top level is a terms aggregation, and on the nested levels there is a combination of terms, stats and percentiles aggregations on deeper levels. With number of shards set to 15 and 20GB of heap on a 3 cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the aggregation takes about 90 seconds, with quite some memory and CPU (about 50%) consumption. Doing this experiment leads me to some questions:
- Will setting routing according to the terms in the top-level aggregation have any impact? I don't know how the aggregations work, but the assumption is that this will make sure all data for the sub aggregations are on the same node. - Does the aggregation operate on the entire document, or does it only load the fields that are aggregated? - Moving from a local dev environment with one node to a three node cluster didn't yield as much improvment as I expected. Am I hitting any "limits" with regards to aggregations of this complexity? - Are there any other optimizations that could affect the performance? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc5eb156-d2f7-4a39-af77-83b9c0a42c7a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
