They are linked below (node5 is the log of the normal node, node6 is the log of the problematic node).
I don't think it was doing big merges, otherwise during the high load period, the merges graph line would have had a "floor" > 0, similar to the time period after I disabled refresh. We don't do routing and use mostly default settings. I think the only settings we changed are: indices.memory.index_buffer_size: 50% index.translog.flush_threshold_ops: 50000 We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of memory with 4 spinning disks. node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> node 6 log (high load) <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote: > > Can you post the IndexWriter infoStream output? I can see if anything > stands out. > > Maybe it was just that this node was doing big merges? I.e., if you > waited long enough, the other shards would eventually do their big merges > too? > > Have you changed any default settings, do custom routing, etc.? Is there > any reason to think that the docs that land on this node are "different" in > any way? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected] > <javascript:>> wrote: > >> From all the information I’ve collected, it seems to be the merging >> activity: >> >> >> 1. We capture the cluster stats into graphite and the current merges >> stat seems to be about 10x higher on this node. >> 2. The actual node that the problem occurs on has happened on >> different physical machines so a h/w issue seems unlikely. Once the >> problem >> starts it doesn't seem to stop. We have blown away the indices in the >> past >> and started indexing again after enabling more logging/stats. >> 3. I've stopped executing queries so the only thing that's happening >> on the cluster is indexing. >> 4. Last night when the problem was ongoing, I disabled refresh >> (index.refresh_interval = -1) around 2:10am. Within 1 hour, the load >> returned to normal. The merge activity seemed to reduce, it seems like 2 >> very long running merges are executing but not much else. >> 5. I grepped an hour of logs of the 2 machiese for "add merge=", it >> was 540 on the high load node and 420 on a normal node. I pulled out the >> size value from the log message and the merges seemed to be much smaller >> on >> the high load node. >> >> I just created the indices a few days ago, so the shards of each index >> are balanced across the nodes. We have external metrics around document >> ingest rate and there was no spike during this time period. >> >> >> >> Thanks >> Kireet >> >> >> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote: >> >>> It's perfectly normal/healthy for many small merges below the floor size >>> to happen. >>> >>> I think you should first figure out why this node is different from the >>> others? Are you sure it's merging CPU cost that's different? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> wrote: >>> >>>> We have a situation where one of the four nodes in our cluster seems >>>> to get caught up endlessly merging. However it seems to be high CPU >>>> activity and not I/O constrainted. I have enabled the IndexWriter info >>>> stream logs, and often times it seems to do merges of quite small segments >>>> (100KB) that are much below the floor size (2MB). I suspect this is due to >>>> frequent refreshes and/or using lots of threads concurrently to do >>>> indexing. Is this true? >>>> >>>> My supposition is that this is leading to the merge policy doing lots >>>> of merges of very small segments into another small segment which will >>>> again require a merge to even reach the floor size. My index has 64 >>>> segments and 25 are below the floor size. I am wondering if there should >>>> be >>>> an exception for the maxMergesAtOnce parameter for the first level so that >>>> many small segments could be merged at once in this case. >>>> >>>> I am considering changing the other parameters (wider tiers, lower >>>> floor size, more concurrent merges allowed) but these all seem to have >>>> side >>>> effects I may not necessarily want. Is there a good solution here? >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd058ed6-5f08-4262-991c-0dd8e29d269a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
