Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 02:10. I included a couple hours before and after in the logs.
On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote: > > They are linked below (node5 is the log of the normal node, node6 is the > log of the problematic node). > > I don't think it was doing big merges, otherwise during the high load > period, the merges graph line would have had a "floor" > 0, similar to the > time period after I disabled refresh. We don't do routing and use mostly > default settings. I think the only settings we changed are: > > indices.memory.index_buffer_size: 50% > index.translog.flush_threshold_ops: 50000 > > We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of > memory with 4 spinning disks. > > node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> > node 6 log (high load) > <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> > > On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote: >> >> Can you post the IndexWriter infoStream output? I can see if anything >> stands out. >> >> Maybe it was just that this node was doing big merges? I.e., if you >> waited long enough, the other shards would eventually do their big merges >> too? >> >> Have you changed any default settings, do custom routing, etc.? Is there >> any reason to think that the docs that land on this node are "different" in >> any way? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> wrote: >> >>> From all the information I’ve collected, it seems to be the merging >>> activity: >>> >>> >>> 1. We capture the cluster stats into graphite and the current merges >>> stat seems to be about 10x higher on this node. >>> 2. The actual node that the problem occurs on has happened on >>> different physical machines so a h/w issue seems unlikely. Once the >>> problem >>> starts it doesn't seem to stop. We have blown away the indices in the >>> past >>> and started indexing again after enabling more logging/stats. >>> 3. I've stopped executing queries so the only thing that's happening >>> on the cluster is indexing. >>> 4. Last night when the problem was ongoing, I disabled refresh >>> (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load >>> returned to normal. The merge activity seemed to reduce, it seems like 2 >>> very long running merges are executing but not much else. >>> 5. I grepped an hour of logs of the 2 machiese for "add merge=", it >>> was 540 on the high load node and 420 on a normal node. I pulled out the >>> size value from the log message and the merges seemed to be much smaller >>> on >>> the high load node. >>> >>> I just created the indices a few days ago, so the shards of each index >>> are balanced across the nodes. We have external metrics around document >>> ingest rate and there was no spike during this time period. >>> >>> >>> >>> Thanks >>> Kireet >>> >>> >>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote: >>> >>>> It's perfectly normal/healthy for many small merges below the floor >>>> size to happen. >>>> >>>> I think you should first figure out why this node is different from the >>>> others? Are you sure it's merging CPU cost that's different? >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> >>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> wrote: >>>> >>>>> We have a situation where one of the four nodes in our cluster seems >>>>> to get caught up endlessly merging. However it seems to be high CPU >>>>> activity and not I/O constrainted. I have enabled the IndexWriter info >>>>> stream logs, and often times it seems to do merges of quite small >>>>> segments >>>>> (100KB) that are much below the floor size (2MB). I suspect this is due >>>>> to >>>>> frequent refreshes and/or using lots of threads concurrently to do >>>>> indexing. Is this true? >>>>> >>>>> My supposition is that this is leading to the merge policy doing lots >>>>> of merges of very small segments into another small segment which will >>>>> again require a merge to even reach the floor size. My index has 64 >>>>> segments and 25 are below the floor size. I am wondering if there should >>>>> be >>>>> an exception for the maxMergesAtOnce parameter for the first level so >>>>> that >>>>> many small segments could be merged at once in this case. >>>>> >>>>> I am considering changing the other parameters (wider tiers, lower >>>>> floor size, more concurrent merges allowed) but these all seem to have >>>>> side >>>>> effects I may not necessarily want. Is there a good solution here? >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755% >>>>> 40googlegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
