Indeed there are no big merges during that time ... I can see on node5, ~14:45 suddenly merges are taking a long time, refresh is taking much longer (4-5 seconds instead of < .4 sec), commit time goes up from < 0.5 sec to ~1-2 sec, etc., but other metrics are fine e.g. total merging GB, number of commits/refreshes is very low during this time.
Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards. The biggish shards are indexing at a very slow rate and only have ~1% deletions. Are you explicitly deleting docs? I suspect something is suddenly cutting into the IO perf of this box, and because merging/refreshing is so IO intensive, it causes these operations to run slower / backlog. Are there any scheduled jobs, e.g. backups/snapshots, that start up? Are you running on virtualized hardware? Mike McCandless http://blog.mikemccandless.com On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy <[email protected]> wrote: > Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 > 02:10. I included a couple hours before and after in the logs. > > > On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote: >> >> They are linked below (node5 is the log of the normal node, node6 is the >> log of the problematic node). >> >> I don't think it was doing big merges, otherwise during the high load >> period, the merges graph line would have had a "floor" > 0, similar to the >> time period after I disabled refresh. We don't do routing and use mostly >> default settings. I think the only settings we changed are: >> >> indices.memory.index_buffer_size: 50% >> index.translog.flush_threshold_ops: 50000 >> >> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of >> memory with 4 spinning disks. >> >> node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> >> node 6 log (high load) >> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> >> >> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote: >>> >>> Can you post the IndexWriter infoStream output? I can see if anything >>> stands out. >>> >>> Maybe it was just that this node was doing big merges? I.e., if you >>> waited long enough, the other shards would eventually do their big merges >>> too? >>> >>> Have you changed any default settings, do custom routing, etc.? Is >>> there any reason to think that the docs that land on this node are >>> "different" in any way? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> wrote: >>> >>>> From all the information I’ve collected, it seems to be the merging >>>> activity: >>>> >>>> >>>> 1. We capture the cluster stats into graphite and the current >>>> merges stat seems to be about 10x higher on this node. >>>> 2. The actual node that the problem occurs on has happened on >>>> different physical machines so a h/w issue seems unlikely. Once the >>>> problem >>>> starts it doesn't seem to stop. We have blown away the indices in the >>>> past >>>> and started indexing again after enabling more logging/stats. >>>> 3. I've stopped executing queries so the only thing that's >>>> happening on the cluster is indexing. >>>> 4. Last night when the problem was ongoing, I disabled refresh >>>> (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load >>>> returned to normal. The merge activity seemed to reduce, it seems like 2 >>>> very long running merges are executing but not much else. >>>> 5. I grepped an hour of logs of the 2 machiese for "add merge=", it >>>> was 540 on the high load node and 420 on a normal node. I pulled out the >>>> size value from the log message and the merges seemed to be much >>>> smaller on >>>> the high load node. >>>> >>>> I just created the indices a few days ago, so the shards of each index >>>> are balanced across the nodes. We have external metrics around document >>>> ingest rate and there was no spike during this time period. >>>> >>>> >>>> >>>> Thanks >>>> Kireet >>>> >>>> >>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote: >>>> >>>>> It's perfectly normal/healthy for many small merges below the floor >>>>> size to happen. >>>>> >>>>> I think you should first figure out why this node is different from >>>>> the others? Are you sure it's merging CPU cost that's different? >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> >>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> >>>>> wrote: >>>>> >>>>>> We have a situation where one of the four nodes in our cluster >>>>>> seems to get caught up endlessly merging. However it seems to be >>>>>> high CPU activity and not I/O constrainted. I have enabled the >>>>>> IndexWriter >>>>>> info stream logs, and often times it seems to do merges of quite small >>>>>> segments (100KB) that are much below the floor size (2MB). I suspect this >>>>>> is due to frequent refreshes and/or using lots of threads concurrently to >>>>>> do indexing. Is this true? >>>>>> >>>>>> My supposition is that this is leading to the merge policy doing lots >>>>>> of merges of very small segments into another small segment which will >>>>>> again require a merge to even reach the floor size. My index has 64 >>>>>> segments and 25 are below the floor size. I am wondering if there should >>>>>> be >>>>>> an exception for the maxMergesAtOnce parameter for the first level so >>>>>> that >>>>>> many small segments could be merged at once in this case. >>>>>> >>>>>> I am considering changing the other parameters (wider tiers, lower >>>>>> floor size, more concurrent merges allowed) but these all seem to have >>>>>> side >>>>>> effects I may not necessarily want. Is there a good solution here? >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40goo >>>>>> glegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRdRARyupdipA6z4ehJL9WYLocA6B-3Br71vH1RgXfeaBg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
