Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 
02:10. I included a couple hours before and after in the logs.

On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
>
> They are linked below (node5 is the log of the normal node, node6 is the 
> log of the problematic node). 
>
> I don't think it was doing big merges, otherwise during the high load 
> period, the merges graph line would have had a "floor" > 0, similar to the 
> time period after I disabled refresh. We don't do routing and use mostly 
> default settings. I think the only settings we changed are:
>
> indices.memory.index_buffer_size: 50%
> index.translog.flush_threshold_ops: 50000
>
> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
> memory with 4 spinning disks. 
>
> node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip>
> node 6 log (high load) 
> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip>
>
> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>>
>> Can you post the IndexWriter infoStream output?  I can see if anything 
>> stands out.
>>
>> Maybe it was just that this node was doing big merges?  I.e., if you 
>> waited long enough, the other shards would eventually do their big merges 
>> too?
>>
>> Have you changed any default settings, do custom routing, etc.?  Is there 
>> any reason to think that the docs that land on this node are "different" in 
>> any way?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> wrote:
>>
>>>  From all the information I’ve collected, it seems to be the merging 
>>> activity:
>>>
>>>
>>>    1. We capture the cluster stats into graphite and the current merges 
>>>    stat seems to be about 10x higher on this node. 
>>>    2. The actual node that the problem occurs on has happened on 
>>>    different physical machines so a h/w issue seems unlikely. Once the 
>>> problem 
>>>    starts it doesn't seem to stop. We have blown away the indices in the 
>>> past 
>>>    and started indexing again after enabling more logging/stats. 
>>>    3. I've stopped executing queries so the only thing that's happening 
>>>    on the cluster is indexing.
>>>    4. Last night when the problem was ongoing, I disabled refresh 
>>>    (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>>    returned to normal. The merge activity seemed to reduce, it seems like 2 
>>>    very long running merges are executing but not much else. 
>>>    5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>>    was 540 on the high load node and 420 on a normal node. I pulled out the 
>>>    size value from the log message and the merges seemed to be much smaller 
>>> on 
>>>    the high load node. 
>>>
>>> I just created the indices a few days ago, so the shards of each index 
>>> are balanced across the nodes. We have external metrics around document 
>>> ingest rate and there was no spike during this time period. 
>>>
>>>
>>>
>>> Thanks
>>> Kireet
>>>
>>>
>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>>
>>>> It's perfectly normal/healthy for many small merges below the floor 
>>>> size to happen.
>>>>
>>>> I think you should first figure out why this node is different from the 
>>>> others?  Are you sure it's merging CPU cost that's different?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> wrote:
>>>>
>>>>>  We have a situation where one of the four nodes in our cluster seems 
>>>>> to get caught up endlessly merging.  However it seems to be high CPU 
>>>>> activity and not I/O constrainted. I have enabled the IndexWriter info 
>>>>> stream logs, and often times it seems to do merges of quite small 
>>>>> segments 
>>>>> (100KB) that are much below the floor size (2MB). I suspect this is due 
>>>>> to 
>>>>> frequent refreshes and/or using lots of threads concurrently to do 
>>>>> indexing. Is this true?
>>>>>
>>>>> My supposition is that this is leading to the merge policy doing lots 
>>>>> of merges of very small segments into another small segment which will 
>>>>> again require a merge to even reach the floor size. My index has 64 
>>>>> segments and 25 are below the floor size. I am wondering if there should 
>>>>> be 
>>>>> an exception for the maxMergesAtOnce parameter for the first level so 
>>>>> that 
>>>>> many small segments could be merged at once in this case.
>>>>>
>>>>> I am considering changing the other parameters (wider tiers, lower 
>>>>> floor size, more concurrent merges allowed) but these all seem to have 
>>>>> side 
>>>>> effects I may not necessarily want. Is there a good solution here?
>>>>>  
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to