Re: Is ElasticSearch truly scalable for analytics?

Mark Harwood Wed, 14 Jan 2015 10:37:37 -0800

>
> Understood, but what about cases where size is set to unlimited?  
> Inaccuracies are not a concern in that case, correct?
>


Correct. But if we only consider the scenarios where the key sets are 
complete and accuracy is not put at risk by merging (i.e. there is no "top 
N" type filtering in play), how many of these sorts of use cases generate 
sufficiently large trees of results where a node-level merging would be 
beneficial? 
 

>
> On Wednesday, January 14, 2015 at 1:09:48 PM UTC-5, Mark Harwood wrote:
>>
>> If you introduce an extra reduction phase (for multiple shards on the 
>> same node) you introduce further potential for inaccuracies in the final 
>> results.
>> Consider the role of 'size' and 'shard_size' in the "terms" aggregation 
>> [1] and the effects they have on accuracy. You'd arguably need a 
>> 'node_size' setting to also control the size of this new intermediate 
>> collection. All stages that reduce the volumes of data processed can 
>> introduce an approximation with the potential for inaccuracies upstream 
>> when merging.
>>
>>
>> [1] 
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size
>>
>> On Wednesday, January 14, 2015 at 5:44:47 PM UTC, Elliott Bradshaw wrote:
>>>
>>> Adrien,
>>>
>>> I get the feeling that you're a pretty heavy contributor to the 
>>> aggregation module.  In your experience, would a shard per cpu core 
>>> strategy be an effective performance solution in a pure aggregation use 
>>> case?    If this could proportionally reduce the aggregation time, would a 
>>> node local reduce (in which all shard aggregations on a given node are 
>>> reduced prior to being sent to the client node) be a good follow on 
>>> strategy for further enhancement?
>>>
>>> On Wednesday, January 14, 2015 at 10:56:03 AM UTC-5, Adrien Grand wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Jan 14, 2015 at 4:16 PM, Elliott Bradshaw <[email protected]> 
>>>> wrote:
>>>>
>>>>> Just out of curiosity, are aggregations on multiple shards on a single 
>>>>> node executed serially or in parallel?  In my experience, it appears that 
>>>>> they're executed serially (my CPU usage did not change when going from 1 
>>>>> shard to 2 shards per node, but I didn't test this extensively).  I'm 
>>>>> interested in maximizing the parallelism of an aggregation without 
>>>>> creating 
>>>>> a massive number of nodes.
>>>>>
>>>>>
>>>> Requests are processed serially per shard, but several shards can be 
>>>> processed at the same time. So if you have an index that consists of N 
>>>> primaries, this would run on N processors of your cluster in parallel.
>>>>
>>>>
>>>> -- 
>>>> Adrien Grand
>>>>  
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1067c158-2902-4530-8238-d4ec92cde992%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Is ElasticSearch truly scalable for analytics?

Reply via email to