Indeed there are no big merges during that time ...

I can see on node5, ~14:45 suddenly merges are taking a long time, refresh
is taking much longer (4-5 seconds instead of < .4 sec), commit time goes
up from < 0.5 sec to ~1-2 sec, etc., but other metrics are fine e.g. total
merging GB, number of commits/refreshes is very low during this time.

Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards.  The
biggish shards are indexing at a very slow rate and only have ~1%
deletions.  Are you explicitly deleting docs?

I suspect something is suddenly cutting into the IO perf of this box, and
because merging/refreshing is so IO intensive, it causes these operations
to run slower / backlog.

Are there any scheduled jobs, e.g. backups/snapshots, that start up?  Are
you running on virtualized hardware?


Mike McCandless

http://blog.mikemccandless.com


On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy <[email protected]> wrote:

> Just to reiterate, the problematic period is from 07/05 14:45 to 07/06
> 02:10. I included a couple hours before and after in the logs.
>
>
> On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
>>
>> They are linked below (node5 is the log of the normal node, node6 is the
>> log of the problematic node).
>>
>> I don't think it was doing big merges, otherwise during the high load
>> period, the merges graph line would have had a "floor" > 0, similar to the
>> time period after I disabled refresh. We don't do routing and use mostly
>> default settings. I think the only settings we changed are:
>>
>> indices.memory.index_buffer_size: 50%
>> index.translog.flush_threshold_ops: 50000
>>
>> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of
>> memory with 4 spinning disks.
>>
>> node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip>
>> node 6 log (high load)
>> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip>
>>
>> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>>>
>>> Can you post the IndexWriter infoStream output?  I can see if anything
>>> stands out.
>>>
>>> Maybe it was just that this node was doing big merges?  I.e., if you
>>> waited long enough, the other shards would eventually do their big merges
>>> too?
>>>
>>> Have you changed any default settings, do custom routing, etc.?  Is
>>> there any reason to think that the docs that land on this node are
>>> "different" in any way?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> wrote:
>>>
>>>>  From all the information I’ve collected, it seems to be the merging
>>>> activity:
>>>>
>>>>
>>>>    1. We capture the cluster stats into graphite and the current
>>>>    merges stat seems to be about 10x higher on this node.
>>>>    2. The actual node that the problem occurs on has happened on
>>>>    different physical machines so a h/w issue seems unlikely. Once the 
>>>> problem
>>>>    starts it doesn't seem to stop. We have blown away the indices in the 
>>>> past
>>>>    and started indexing again after enabling more logging/stats.
>>>>    3. I've stopped executing queries so the only thing that's
>>>>    happening on the cluster is indexing.
>>>>    4. Last night when the problem was ongoing, I disabled refresh
>>>>    (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load
>>>>    returned to normal. The merge activity seemed to reduce, it seems like 2
>>>>    very long running merges are executing but not much else.
>>>>    5. I grepped an hour of logs of the 2 machiese for "add merge=", it
>>>>    was 540 on the high load node and 420 on a normal node. I pulled out the
>>>>    size value from the log message and the merges seemed to be much 
>>>> smaller on
>>>>    the high load node.
>>>>
>>>> I just created the indices a few days ago, so the shards of each index
>>>> are balanced across the nodes. We have external metrics around document
>>>> ingest rate and there was no spike during this time period.
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Kireet
>>>>
>>>>
>>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>>>
>>>>> It's perfectly normal/healthy for many small merges below the floor
>>>>> size to happen.
>>>>>
>>>>> I think you should first figure out why this node is different from
>>>>> the others?  Are you sure it's merging CPU cost that's different?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>  We have a situation where one of the four nodes in our cluster
>>>>>> seems to get caught up endlessly merging.  However it seems to be
>>>>>> high CPU activity and not I/O constrainted. I have enabled the 
>>>>>> IndexWriter
>>>>>> info stream logs, and often times it seems to do merges of quite small
>>>>>> segments (100KB) that are much below the floor size (2MB). I suspect this
>>>>>> is due to frequent refreshes and/or using lots of threads concurrently to
>>>>>> do indexing. Is this true?
>>>>>>
>>>>>> My supposition is that this is leading to the merge policy doing lots
>>>>>> of merges of very small segments into another small segment which will
>>>>>> again require a merge to even reach the floor size. My index has 64
>>>>>> segments and 25 are below the floor size. I am wondering if there should 
>>>>>> be
>>>>>> an exception for the maxMergesAtOnce parameter for the first level so 
>>>>>> that
>>>>>> many small segments could be merged at once in this case.
>>>>>>
>>>>>> I am considering changing the other parameters (wider tiers, lower
>>>>>> floor size, more concurrent merges allowed) but these all seem to have 
>>>>>> side
>>>>>> effects I may not necessarily want. Is there a good solution here?
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>>
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40goo
>>>>>> glegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAD7smRdRARyupdipA6z4ehJL9WYLocA6B-3Br71vH1RgXfeaBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to