Correction "meaning reduce from 32 buckets to only one bucket, then the 
client node only has to process 50 buckets." should read "reduce from 32 of 
50K buckets to only 1 being sent to the client node, then the client node 
only has to process 50 of 50K buckets".  

On Thursday, December 18, 2014 10:48:07 AM UTC-5, Yifan Wang wrote:
>
> Sorry, if I did not make it clear. For sure I know aggregation is done on 
> the node for each shard, but here is the challenge. Say we set 
> shard_size=50,000. ES will aggregate on each shard and create buckets for 
> the matching documents, and then send top 50,000 buckets to the client node 
> for Reduce. Say we have 50 data nodes, and each node has 32 shards. This 
> means we need to send 50,000 buckets from each shard to the client node for 
> final aggregation. First, this may add heavy traffic to the network (what 
> if we have 100 nodes?). And second, the client will need to aggregate on 
> received 50*32*50,000 buckets. Would this cause any congestion on the 
> client node? However if we can aggregate on the node first, meaning reduce 
> from 32 buckets to only one bucket, then the client node only has to 
> process 50 buckets. This would significantly reduce the network traffic and 
> improve the scalability, plus because we can set relatively larger 
> shard_size, it will improve the accuracy of the final results, which is 
> another key issue we are facing in distributed environment on aggregations.
>
> So my key question is about the scalability particularly on aggregations. 
> It seems to be a challenge in my experience. I just want to hear other 
> people's experience. On heavy analytics applications, this will be a key.
>
> Of course, I also understand, adding node level aggregation may impact the 
> overall performance. I am wondering if anyone has thought about or done 
> anything in this aspect.
>
> BTW, I like ElasticSearch, but want to hear from the community on some of 
> the key challenges.
>
>
>
> On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote:
>>
>> +1 to what AlexR said. I think there is indeed a bad assumption that 
>> shards just forward data to the coordinating node, this is not the case.
>>
>> On Thu, Dec 18, 2014 at 1:09 AM, AlexR <[email protected]> wrote:
>>>
>>> if you take a terms aggregation, the heavy lifting of the aggregation is 
>>> done on each node then aggregated results are combined on the master node. 
>>> So if you have thousands of nodes and very high cardinality nested aggs the 
>>> merging part may become a bottleneck but cost of doing actual aggregation 
>>> in most cases is far higher than cost of merging results from reasonable 
>>> number of shards. So in practice I think it balances pretty well. Of course 
>>> you are not limited to one master to handle concurrent requests
>>>
>>> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote:
>>>>
>>>> I thought ES only "Collect" on individual shards, and "Reduce" on 
>>>> Client Node (master if you call it), nothing is done at the data node 
>>>> level.
>>>>
>>>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
>>>>>
>>>>> ES already doing aggregations on each node. it is not like it is 
>>>>> shipping row level query data back to master for aggregation. 
>>>>> In fact, one unpleasant effect of it is that aggregation results are 
>>>>> not guaranteed to be precise due to distributed nature of the aggregation 
>>>>> for multibucket aggs ordered by count such as terms
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> -- 
>> Adrien Grand
>>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7ae50080-9e09-48df-bd50-552e74773c1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to