Sorry, if I did not make it clear. For sure I know aggregation is done on 
the node for each shard, but here is the challenge. Say we set 
shard_size=50,000. ES will aggregate on each shard and create buckets for 
the matching documents, and then send top 50,000 buckets to the client node 
for Reduce. Say we have 50 data nodes, and each node has 32 shards. This 
means we need to send 50,000 buckets from each shard to the client node for 
final aggregation. First, this may add heavy traffic to the network (what 
if we have 100 nodes?). And second, the client will need to aggregate on 
received 50*32*50,000 buckets. Would this cause any congestion on the 
client node? However if we can aggregate on the node first, meaning reduce 
from 32 buckets to only one bucket, then the client node only has to 
process 50 buckets. This would significanly reduce the network traffic and 
improve the scalability, plus because we can set relatively larger 
shard_size, it will improve the accuracy of the final results, which is 
another key issue we are facing in distributed environment on aggregations.

So my key question is about the scalability particularly on aggregations. 
It seems to be a challenge in my experience. I just want to hear other 
people's experience. On heavy analytics applications, this will be a key.

Of course, I also understand, adding node level aggregation may impact the 
overall performance. I am wondering if anyone has thought about or done 
anything in this aspect.

BTW, I like ElasticSearch, but want to hear from the community on some of 
the key challenges.



On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote:
>
> +1 to what AlexR said. I think there is indeed a bad assumption that 
> shards just forward data to the coordinating node, this is not the case.
>
> On Thu, Dec 18, 2014 at 1:09 AM, AlexR <[email protected] <javascript:>> 
> wrote:
>>
>> if you take a terms aggregation, the heavy lifting of the aggregation is 
>> done on each node then aggregated results are combined on the master node. 
>> So if you have thousands of nodes and very high cardinality nested aggs the 
>> merging part may become a bottleneck but cost of doing actual aggregation 
>> in most cases is far higher than cost of merging results from reasonable 
>> number of shards. So in practice I think it balances pretty well. Of course 
>> you are not limited to one master to handle concurrent requests
>>
>> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote:
>>>
>>> I thought ES only "Collect" on individual shards, and "Reduce" on Client 
>>> Node (master if you call it), nothing is done at the data node level.
>>>
>>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
>>>>
>>>> ES already doing aggregations on each node. it is not like it is 
>>>> shipping row level query data back to master for aggregation. 
>>>> In fact, one unpleasant effect of it is that aggregation results are 
>>>> not guaranteed to be precise due to distributed nature of the aggregation 
>>>> for multibucket aggs ordered by count such as terms
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to