They are linked below (node5 is the log of the normal node, node6 is the 
log of the problematic node). 

I don't think it was doing big merges, otherwise during the high load 
period, the merges graph line would have had a "floor" > 0, similar to the 
time period after I disabled refresh. We don't do routing and use mostly 
default settings. I think the only settings we changed are:

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000

We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
memory with 4 spinning disks. 

node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip>
node 6 log (high load) <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip>

On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>
> Can you post the IndexWriter infoStream output?  I can see if anything 
> stands out.
>
> Maybe it was just that this node was doing big merges?  I.e., if you 
> waited long enough, the other shards would eventually do their big merges 
> too?
>
> Have you changed any default settings, do custom routing, etc.?  Is there 
> any reason to think that the docs that land on this node are "different" in 
> any way?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected] 
> <javascript:>> wrote:
>
>>  From all the information I’ve collected, it seems to be the merging 
>> activity:
>>
>>
>>    1. We capture the cluster stats into graphite and the current merges 
>>    stat seems to be about 10x higher on this node. 
>>    2. The actual node that the problem occurs on has happened on 
>>    different physical machines so a h/w issue seems unlikely. Once the 
>> problem 
>>    starts it doesn't seem to stop. We have blown away the indices in the 
>> past 
>>    and started indexing again after enabling more logging/stats. 
>>    3. I've stopped executing queries so the only thing that's happening 
>>    on the cluster is indexing.
>>    4. Last night when the problem was ongoing, I disabled refresh 
>>    (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>    returned to normal. The merge activity seemed to reduce, it seems like 2 
>>    very long running merges are executing but not much else. 
>>    5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>    was 540 on the high load node and 420 on a normal node. I pulled out the 
>>    size value from the log message and the merges seemed to be much smaller 
>> on 
>>    the high load node. 
>>
>> I just created the indices a few days ago, so the shards of each index 
>> are balanced across the nodes. We have external metrics around document 
>> ingest rate and there was no spike during this time period. 
>>
>>
>>
>> Thanks
>> Kireet
>>
>>
>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>
>>> It's perfectly normal/healthy for many small merges below the floor size 
>>> to happen.
>>>
>>> I think you should first figure out why this node is different from the 
>>> others?  Are you sure it's merging CPU cost that's different?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> wrote:
>>>
>>>>  We have a situation where one of the four nodes in our cluster seems 
>>>> to get caught up endlessly merging.  However it seems to be high CPU 
>>>> activity and not I/O constrainted. I have enabled the IndexWriter info 
>>>> stream logs, and often times it seems to do merges of quite small segments 
>>>> (100KB) that are much below the floor size (2MB). I suspect this is due to 
>>>> frequent refreshes and/or using lots of threads concurrently to do 
>>>> indexing. Is this true?
>>>>
>>>> My supposition is that this is leading to the merge policy doing lots 
>>>> of merges of very small segments into another small segment which will 
>>>> again require a merge to even reach the floor size. My index has 64 
>>>> segments and 25 are below the floor size. I am wondering if there should 
>>>> be 
>>>> an exception for the maxMergesAtOnce parameter for the first level so that 
>>>> many small segments could be merged at once in this case.
>>>>
>>>> I am considering changing the other parameters (wider tiers, lower 
>>>> floor size, more concurrent merges allowed) but these all seem to have 
>>>> side 
>>>> effects I may not necessarily want. Is there a good solution here?
>>>>  
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fd058ed6-5f08-4262-991c-0dd8e29d269a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to