Sorry, if I did not make it clear. For sure I know aggregation is done on the node for each shard, but here is the challenge. Say we set shard_size=50,000. ES will aggregate on each shard and create buckets for the matching documents, and then send top 50,000 buckets to the client node for Reduce. Say we have 50 data nodes, and each node has 32 shards. This means we need to send 50,000 buckets from each shard to the client node for final aggregation. First, this may add heavy traffic to the network (what if we have 100 nodes?). And second, the client will need to aggregate on received 50*32*50,000 buckets. Would this cause any congestion on the client node? However if we can aggregate on the node first, meaning reduce from 32 buckets to only one bucket, then the client node only has to process 50 buckets. This would significanly reduce the network traffic and improve the scalability, plus because we can set relatively larger shard_size, it will improve the accuracy of the final results, which is another key issue we are facing in distributed environment on aggregations.
So my key question is about the scalability particularly on aggregations. It seems to be a challenge in my experience. I just want to hear other people's experience. On heavy analytics applications, this will be a key. Of course, I also understand, adding node level aggregation may impact the overall performance. I am wondering if anyone has thought about or done anything in this aspect. BTW, I like ElasticSearch, but want to hear from the community on some of the key challenges. On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote: > > +1 to what AlexR said. I think there is indeed a bad assumption that > shards just forward data to the coordinating node, this is not the case. > > On Thu, Dec 18, 2014 at 1:09 AM, AlexR <[email protected] <javascript:>> > wrote: >> >> if you take a terms aggregation, the heavy lifting of the aggregation is >> done on each node then aggregated results are combined on the master node. >> So if you have thousands of nodes and very high cardinality nested aggs the >> merging part may become a bottleneck but cost of doing actual aggregation >> in most cases is far higher than cost of merging results from reasonable >> number of shards. So in practice I think it balances pretty well. Of course >> you are not limited to one master to handle concurrent requests >> >> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote: >>> >>> I thought ES only "Collect" on individual shards, and "Reduce" on Client >>> Node (master if you call it), nothing is done at the data node level. >>> >>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote: >>>> >>>> ES already doing aggregations on each node. it is not like it is >>>> shipping row level query data back to master for aggregation. >>>> In fact, one unpleasant effect of it is that aggregation results are >>>> not guaranteed to be precise due to distributed nature of the aggregation >>>> for multibucket aggs ordered by count such as terms >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > Adrien Grand > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
