Another question: have you disabled merge throttling? And, which version of ES are you using?
Mike McCandless http://blog.mikemccandless.com On Thu, Jul 10, 2014 at 5:49 AM, Michael McCandless <[email protected]> wrote: > Indeed the hot threads on node5 didn't reveal anything unexpected: they > are busy merging and refreshing. One thread was in SmartChineseAnalyzer... > > Very strange that young gen GC is suddenly so costly. This is much more > time spent in GC than you ever see on healthy nodes? > > If you restart the JVM on that node, does the problem immediately start > happening again? (Hmm: does restarting the JVM cause the others to start > building new replicas... which would mess up this experiment). Separately, > if you reboot the node and bring it back up, does the problem start > happening again? > > Does vmstat/iostat reveal anything? E.g. is the OS doing a lot of > swapping or something? Do you mlockall your JVM heap (bootstrap.mlockall)? > What is swappiness set to? > > Which Java / OS versions are you using? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Jul 9, 2014 at 5:13 PM, Kireet Reddy <[email protected]> wrote: > >> Sorry, here it is: >> >> https://www.dropbox.com/sh/3s6m0bhz4eshi6m/AAABnRXFCLrCne-GLG1zvQP3a >> >> Also a couple of graphs of the memory usage. >> >> >> On Wednesday, July 9, 2014 2:10:49 PM UTC-7, Michael McCandless wrote: >> >>> Hmm link doesn't seem to work? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Wed, Jul 9, 2014 at 4:06 PM, Kireet Reddy <[email protected]> wrote: >>> >>>> Sorry, forgot the link >>>> >>>> https://www.dropbox.com/sh/3s6m0bhz4eshi6m/AAD8g3Ukq1UW0IbPV-a-CrBGa/ >>>> 1229.txt >>>> >>>> On Wednesday, July 9, 2014 1:05:56 PM UTC-7, Kireet Reddy wrote: >>>>> >>>>> The problem is happening again, this time on node 5. I have captured a >>>>> few hot thread requests here. I also included one from node 6 (which is >>>>> now >>>>> fine).There are merge related stacks, but it seems like everything is >>>>> taking a lot more cpu than usual. I did a few type=wait and type=block >>>>> dumps and the result was always 0% usage there. Also young gen gc activity >>>>> has again gone crazy (old gen/heap size seems fine). Would hot thread >>>>> measurements be distorted if gc activity is very high? >>>>> >>>>> It seems strange to me that this would only happen on one node while >>>>> we have replica set to at least 1 for all our indices. It seems like the >>>>> problems should happen on a couple nodes simultaneously. >>>>> >>>>> --Kireet >>>>> >>>>> >>>>> On Monday, July 7, 2014 3:51:35 PM UTC-7, Michael McCandless wrote: >>>>> >>>>>> Could you pull all hot threads next time the problem happens? >>>>>> >>>>>> Mike McCandless >>>>>> >>>>>> http://blog.mikemccandless.com >>>>>> >>>>>> >>>>>> On Mon, Jul 7, 2014 at 3:47 PM, Kireet Reddy <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> All that seems correct (except I think this is for node 6, not >>>>>>> node 5). We don't delete documents, but we do some updates. The vast >>>>>>> majority of documents get indexed into the large shards, but the smaller >>>>>>> ones take some writes as well. >>>>>>> >>>>>>> We aren't using virtualized hardware and elasticsearch is the only >>>>>>> thing running on the machines, no scheduled jobs, etc. I don't think >>>>>>> something is interfering, actually overall disk i/o rate and operations >>>>>>> on >>>>>>> the machine go down quite a bit during the problematic period, which is >>>>>>> consistent with your observations about things taking longer. >>>>>>> >>>>>>> I went back and checked all our collected metrics again. I noticed >>>>>>> that even though the heap usage and gc count seems smooth during the >>>>>>> period >>>>>>> in question, gc time spent goes way up. Also active indexing threads >>>>>>> goes >>>>>>> up, but since our ingest rate didn't go up I assumed this was a side >>>>>>> effect. During a previous occurrence a few days ago on node5, I stopped >>>>>>> all >>>>>>> indexing activity for 15 minutes. Active merges and indexing requests >>>>>>> went >>>>>>> to zero as expected. Then I re-enabled indexing and immediately the >>>>>>> increased cpu/gc/active merges went back up to the problematic rates. >>>>>>> >>>>>>> Overall this is pretty confusing to me as to what is a symptom vs a >>>>>>> root cause here. A summary of what I think I know: >>>>>>> >>>>>>> 1. Every few days, cpu usage on a node goes way above the other >>>>>>> nodes and doesn't recover. We've let the node run in the elevated >>>>>>> cpu state >>>>>>> for a day with no improvement. >>>>>>> 2. It doesn't seem likely that it's data related. We use >>>>>>> replicas=1 and no other nodes have issues. >>>>>>> 3. It doesn't seem hardware related. We run on a dedicated h/w >>>>>>> with elasticsearch being the only thing running. Also the problem >>>>>>> appears >>>>>>> on various nodes and machine load seems tied directly to the >>>>>>> elasticsearch >>>>>>> process. >>>>>>> 4. During the problematic period: cpu usage, active merge >>>>>>> threads, active bulk (indexing) threads, and gc time are elevated. >>>>>>> 5. During the problematic period: i/o ops and i/o throughput >>>>>>> decrease. >>>>>>> 6. overall heap usage size seems to smoothly increase, the extra >>>>>>> gc time seems to be spent on the new gen. Interestingly, the gc count >>>>>>> didn't seem to increase. >>>>>>> 7. In the hours beforehand, gc behavior of the problematic node >>>>>>> was similar to the other nodes. >>>>>>> 8. If I pause indexing, machine load quickly returns to normal, >>>>>>> merges and indexing requests complete. if I then restart indexing >>>>>>> the >>>>>>> problem reoccurs immediately. >>>>>>> 9. If I disable automatic refreshes, the problem disappears >>>>>>> within an hour or so. >>>>>>> 10. hot threads show merging activity as the hot threads. >>>>>>> >>>>>>> The first few points make me think the increased active merges is >>>>>>> perhaps a side effect, but then the last 3 make me think merging is the >>>>>>> root cause. The only additional things I can think of that may be >>>>>>> relevant >>>>>>> are: >>>>>>> >>>>>>> 1. Our documents can vary greatly in size, they average a couple >>>>>>> KB but can rarely be several MB. >>>>>>> 2. we do use language analysis plugins, perhaps one of these is >>>>>>> acting up? >>>>>>> 3. We eagerly load one field into the field data cache. But the >>>>>>> cache size is ok and the overall heap behavior is ok so I don't >>>>>>> think this >>>>>>> is the problem. >>>>>>> >>>>>>> That's a lot of information, but I am not sure where to go next from >>>>>>> here... >>>>>>> >>>>>>> On Monday, July 7, 2014 8:23:20 AM UTC-7, Michael McCandless wrote: >>>>>>> >>>>>>>> Indeed there are no big merges during that time ... >>>>>>>> >>>>>>>> I can see on node5, ~14:45 suddenly merges are taking a long time, >>>>>>>> refresh is taking much longer (4-5 seconds instead of < .4 sec), commit >>>>>>>> time goes up from < 0.5 sec to ~1-2 sec, etc., but other metrics are >>>>>>>> fine >>>>>>>> e.g. total merging GB, number of commits/refreshes is very low during >>>>>>>> this >>>>>>>> time. >>>>>>>> >>>>>>>> Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards. >>>>>>>> The biggish shards are indexing at a very slow rate and only have ~1% >>>>>>>> deletions. Are you explicitly deleting docs? >>>>>>>> >>>>>>>> I suspect something is suddenly cutting into the IO perf of this >>>>>>>> box, and because merging/refreshing is so IO intensive, it causes these >>>>>>>> operations to run slower / backlog. >>>>>>>> >>>>>>>> Are there any scheduled jobs, e.g. backups/snapshots, that start >>>>>>>> up? Are you running on virtualized hardware? >>>>>>>> >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Just to reiterate, the problematic period is from 07/05 14:45 to >>>>>>>>> 07/06 02:10. I included a couple hours before and after in the logs. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote: >>>>>>>>>> >>>>>>>>>> They are linked below (node5 is the log of the normal node, node6 >>>>>>>>>> is the log of the problematic node). >>>>>>>>>> >>>>>>>>>> I don't think it was doing big merges, otherwise during the high >>>>>>>>>> load period, the merges graph line would have had a "floor" > 0, >>>>>>>>>> similar to >>>>>>>>>> the time period after I disabled refresh. We don't do routing and use >>>>>>>>>> mostly default settings. I think the only settings we changed are: >>>>>>>>>> >>>>>>>>>> indices.memory.index_buffer_size: 50% >>>>>>>>>> index.translog.flush_threshold_ops: 50000 >>>>>>>>>> >>>>>>>>>> We are running on a 6 cpu/12 cores machine with a 32GB heap and >>>>>>>>>> 96GB of memory with 4 spinning disks. >>>>>>>>>> >>>>>>>>>> node 5 log (normal) >>>>>>>>>> <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> >>>>>>>>>> node 6 log (high load) >>>>>>>>>> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> >>>>>>>>>> >>>>>>>>>> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Can you post the IndexWriter infoStream output? I can see if >>>>>>>>>>> anything stands out. >>>>>>>>>>> >>>>>>>>>>> Maybe it was just that this node was doing big merges? I.e., if >>>>>>>>>>> you waited long enough, the other shards would eventually do their >>>>>>>>>>> big >>>>>>>>>>> merges too? >>>>>>>>>>> >>>>>>>>>>> Have you changed any default settings, do custom routing, etc.? >>>>>>>>>>> Is there any reason to think that the docs that land on this node >>>>>>>>>>> are >>>>>>>>>>> "different" in any way? >>>>>>>>>>> >>>>>>>>>>> Mike McCandless >>>>>>>>>>> >>>>>>>>>>> http://blog.mikemccandless.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> From all the information I’ve collected, it seems to be the >>>>>>>>>>>> merging activity: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. We capture the cluster stats into graphite and the >>>>>>>>>>>> current merges stat seems to be about 10x higher on this node. >>>>>>>>>>>> 2. The actual node that the problem occurs on has happened >>>>>>>>>>>> on different physical machines so a h/w issue seems unlikely. >>>>>>>>>>>> Once the >>>>>>>>>>>> problem starts it doesn't seem to stop. We have blown away the >>>>>>>>>>>> indices in >>>>>>>>>>>> the past and started indexing again after enabling more >>>>>>>>>>>> logging/stats. >>>>>>>>>>>> 3. I've stopped executing queries so the only thing that's >>>>>>>>>>>> happening on the cluster is indexing. >>>>>>>>>>>> 4. Last night when the problem was ongoing, I disabled >>>>>>>>>>>> refresh (index.refresh_interval = -1) around 2:10am. Within 1 >>>>>>>>>>>> hour, the >>>>>>>>>>>> load returned to normal. The merge activity seemed to reduce, >>>>>>>>>>>> it seems like >>>>>>>>>>>> 2 very long running merges are executing but not much else. >>>>>>>>>>>> 5. I grepped an hour of logs of the 2 machiese for "add >>>>>>>>>>>> merge=", it was 540 on the high load node and 420 on a normal >>>>>>>>>>>> node. I >>>>>>>>>>>> pulled out the size value from the log message and the merges >>>>>>>>>>>> seemed to be >>>>>>>>>>>> much smaller on the high load node. >>>>>>>>>>>> >>>>>>>>>>>> I just created the indices a few days ago, so the shards of >>>>>>>>>>>> each index are balanced across the nodes. We have external metrics >>>>>>>>>>>> around >>>>>>>>>>>> document ingest rate and there was no spike during this time >>>>>>>>>>>> period. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Kireet >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> It's perfectly normal/healthy for many small merges below the >>>>>>>>>>>>> floor size to happen. >>>>>>>>>>>>> >>>>>>>>>>>>> I think you should first figure out why this node is different >>>>>>>>>>>>> from the others? Are you sure it's merging CPU cost that's >>>>>>>>>>>>> different? >>>>>>>>>>>>> >>>>>>>>>>>>> Mike McCandless >>>>>>>>>>>>> >>>>>>>>>>>>> http://blog.mikemccandless.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> We have a situation where one of the four nodes in our >>>>>>>>>>>>>> cluster seems to get caught up endlessly merging. However >>>>>>>>>>>>>> it seems to be high CPU activity and not I/O constrainted. I >>>>>>>>>>>>>> have enabled >>>>>>>>>>>>>> the IndexWriter info stream logs, and often times it seems to do >>>>>>>>>>>>>> merges of >>>>>>>>>>>>>> quite small segments (100KB) that are much below the floor size >>>>>>>>>>>>>> (2MB). I >>>>>>>>>>>>>> suspect this is due to frequent refreshes and/or using lots of >>>>>>>>>>>>>> threads >>>>>>>>>>>>>> concurrently to do indexing. Is this true? >>>>>>>>>>>>>> >>>>>>>>>>>>>> My supposition is that this is leading to the merge policy >>>>>>>>>>>>>> doing lots of merges of very small segments into another small >>>>>>>>>>>>>> segment >>>>>>>>>>>>>> which will again require a merge to even reach the floor size. >>>>>>>>>>>>>> My index has >>>>>>>>>>>>>> 64 segments and 25 are below the floor size. I am wondering if >>>>>>>>>>>>>> there should >>>>>>>>>>>>>> be an exception for the maxMergesAtOnce parameter for the first >>>>>>>>>>>>>> level so >>>>>>>>>>>>>> that many small segments could be merged at once in this case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am considering changing the other parameters (wider tiers, >>>>>>>>>>>>>> lower floor size, more concurrent merges allowed) but these all >>>>>>>>>>>>>> seem to >>>>>>>>>>>>>> have side effects I may not necessarily want. Is there a good >>>>>>>>>>>>>> solution here? >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>> Google Groups "elasticsearch" group. >>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0 >>>>>>>>>>>>>> b-49cb-b29d-e396510bf755%40googlegroups.com >>>>>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "elasticsearch" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/edc22069-867 >>>>>>>>>>>> 4-41db-ab06-226b05d293aa%40googlegroups.com >>>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "elasticsearch" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6d >>>>>>>>> d-4de2-aa59-003f57d2b446%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/a855fbe1-13f >>>>>>> 3-4c53-8607-a26b8063de1f%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/a855fbe1-13f3-4c53-8607-a26b8063de1f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/5ec4e59a-ab42-4d3c-b950-8b6246dac1dc%40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/5ec4e59a-ab42-4d3c-b950-8b6246dac1dc%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRf%3DLMa9ryWm%3DUZrS9QXq4rTy%3DbF4hDt8xx%3DcpLRZD3z8w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
