Hmm link doesn't seem to work? Mike McCandless
http://blog.mikemccandless.com On Wed, Jul 9, 2014 at 4:06 PM, Kireet Reddy <[email protected]> wrote: > Sorry, forgot the link > > > https://www.dropbox.com/sh/3s6m0bhz4eshi6m/AAD8g3Ukq1UW0IbPV-a-CrBGa/1229.txt > > On Wednesday, July 9, 2014 1:05:56 PM UTC-7, Kireet Reddy wrote: >> >> The problem is happening again, this time on node 5. I have captured a >> few hot thread requests here. I also included one from node 6 (which is now >> fine).There are merge related stacks, but it seems like everything is >> taking a lot more cpu than usual. I did a few type=wait and type=block >> dumps and the result was always 0% usage there. Also young gen gc activity >> has again gone crazy (old gen/heap size seems fine). Would hot thread >> measurements be distorted if gc activity is very high? >> >> It seems strange to me that this would only happen on one node while we >> have replica set to at least 1 for all our indices. It seems like the >> problems should happen on a couple nodes simultaneously. >> >> --Kireet >> >> >> On Monday, July 7, 2014 3:51:35 PM UTC-7, Michael McCandless wrote: >> >>> Could you pull all hot threads next time the problem happens? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Mon, Jul 7, 2014 at 3:47 PM, Kireet Reddy <[email protected]> wrote: >>> >>>> All that seems correct (except I think this is for node 6, not node >>>> 5). We don't delete documents, but we do some updates. The vast majority of >>>> documents get indexed into the large shards, but the smaller ones take some >>>> writes as well. >>>> >>>> We aren't using virtualized hardware and elasticsearch is the only >>>> thing running on the machines, no scheduled jobs, etc. I don't think >>>> something is interfering, actually overall disk i/o rate and operations on >>>> the machine go down quite a bit during the problematic period, which is >>>> consistent with your observations about things taking longer. >>>> >>>> I went back and checked all our collected metrics again. I noticed that >>>> even though the heap usage and gc count seems smooth during the period in >>>> question, gc time spent goes way up. Also active indexing threads goes up, >>>> but since our ingest rate didn't go up I assumed this was a side effect. >>>> During a previous occurrence a few days ago on node5, I stopped all >>>> indexing activity for 15 minutes. Active merges and indexing requests went >>>> to zero as expected. Then I re-enabled indexing and immediately the >>>> increased cpu/gc/active merges went back up to the problematic rates. >>>> >>>> Overall this is pretty confusing to me as to what is a symptom vs a >>>> root cause here. A summary of what I think I know: >>>> >>>> 1. Every few days, cpu usage on a node goes way above the other >>>> nodes and doesn't recover. We've let the node run in the elevated cpu >>>> state >>>> for a day with no improvement. >>>> 2. It doesn't seem likely that it's data related. We use replicas=1 >>>> and no other nodes have issues. >>>> 3. It doesn't seem hardware related. We run on a dedicated h/w with >>>> elasticsearch being the only thing running. Also the problem appears on >>>> various nodes and machine load seems tied directly to the elasticsearch >>>> process. >>>> 4. During the problematic period: cpu usage, active merge threads, >>>> active bulk (indexing) threads, and gc time are elevated. >>>> 5. During the problematic period: i/o ops and i/o throughput >>>> decrease. >>>> 6. overall heap usage size seems to smoothly increase, the extra gc >>>> time seems to be spent on the new gen. Interestingly, the gc count >>>> didn't >>>> seem to increase. >>>> 7. In the hours beforehand, gc behavior of the problematic node was >>>> similar to the other nodes. >>>> 8. If I pause indexing, machine load quickly returns to normal, >>>> merges and indexing requests complete. if I then restart indexing the >>>> problem reoccurs immediately. >>>> 9. If I disable automatic refreshes, the problem disappears within >>>> an hour or so. >>>> 10. hot threads show merging activity as the hot threads. >>>> >>>> The first few points make me think the increased active merges is >>>> perhaps a side effect, but then the last 3 make me think merging is the >>>> root cause. The only additional things I can think of that may be relevant >>>> are: >>>> >>>> 1. Our documents can vary greatly in size, they average a couple KB >>>> but can rarely be several MB. >>>> 2. we do use language analysis plugins, perhaps one of these is >>>> acting up? >>>> 3. We eagerly load one field into the field data cache. But the >>>> cache size is ok and the overall heap behavior is ok so I don't think >>>> this >>>> is the problem. >>>> >>>> That's a lot of information, but I am not sure where to go next from >>>> here... >>>> >>>> On Monday, July 7, 2014 8:23:20 AM UTC-7, Michael McCandless wrote: >>>> >>>>> Indeed there are no big merges during that time ... >>>>> >>>>> I can see on node5, ~14:45 suddenly merges are taking a long time, >>>>> refresh is taking much longer (4-5 seconds instead of < .4 sec), commit >>>>> time goes up from < 0.5 sec to ~1-2 sec, etc., but other metrics are fine >>>>> e.g. total merging GB, number of commits/refreshes is very low during this >>>>> time. >>>>> >>>>> Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards. The >>>>> biggish shards are indexing at a very slow rate and only have ~1% >>>>> deletions. Are you explicitly deleting docs? >>>>> >>>>> I suspect something is suddenly cutting into the IO perf of this box, >>>>> and because merging/refreshing is so IO intensive, it causes these >>>>> operations to run slower / backlog. >>>>> >>>>> Are there any scheduled jobs, e.g. backups/snapshots, that start up? >>>>> Are you running on virtualized hardware? >>>>> >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> >>>>> On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy <[email protected]> >>>>> wrote: >>>>> >>>>>> Just to reiterate, the problematic period is from 07/05 14:45 to >>>>>> 07/06 02:10. I included a couple hours before and after in the logs. >>>>>> >>>>>> >>>>>> On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote: >>>>>>> >>>>>>> They are linked below (node5 is the log of the normal node, node6 is >>>>>>> the log of the problematic node). >>>>>>> >>>>>>> I don't think it was doing big merges, otherwise during the high >>>>>>> load period, the merges graph line would have had a "floor" > 0, >>>>>>> similar to >>>>>>> the time period after I disabled refresh. We don't do routing and use >>>>>>> mostly default settings. I think the only settings we changed are: >>>>>>> >>>>>>> indices.memory.index_buffer_size: 50% >>>>>>> index.translog.flush_threshold_ops: 50000 >>>>>>> >>>>>>> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB >>>>>>> of memory with 4 spinning disks. >>>>>>> >>>>>>> node 5 log (normal) >>>>>>> <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip> >>>>>>> node 6 log (high load) >>>>>>> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip> >>>>>>> >>>>>>> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote: >>>>>>>> >>>>>>>> Can you post the IndexWriter infoStream output? I can see if >>>>>>>> anything stands out. >>>>>>>> >>>>>>>> Maybe it was just that this node was doing big merges? I.e., if >>>>>>>> you waited long enough, the other shards would eventually do their big >>>>>>>> merges too? >>>>>>>> >>>>>>>> Have you changed any default settings, do custom routing, etc.? Is >>>>>>>> there any reason to think that the docs that land on this node are >>>>>>>> "different" in any way? >>>>>>>> >>>>>>>> Mike McCandless >>>>>>>> >>>>>>>> http://blog.mikemccandless.com >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> From all the information I’ve collected, it seems to be the >>>>>>>>> merging activity: >>>>>>>>> >>>>>>>>> >>>>>>>>> 1. We capture the cluster stats into graphite and the current >>>>>>>>> merges stat seems to be about 10x higher on this node. >>>>>>>>> 2. The actual node that the problem occurs on has happened on >>>>>>>>> different physical machines so a h/w issue seems unlikely. Once >>>>>>>>> the problem >>>>>>>>> starts it doesn't seem to stop. We have blown away the indices in >>>>>>>>> the past >>>>>>>>> and started indexing again after enabling more logging/stats. >>>>>>>>> 3. I've stopped executing queries so the only thing that's >>>>>>>>> happening on the cluster is indexing. >>>>>>>>> 4. Last night when the problem was ongoing, I disabled refresh >>>>>>>>> (index.refresh_interval = -1) around 2:10am. Within 1 hour, the >>>>>>>>> load >>>>>>>>> returned to normal. The merge activity seemed to reduce, it seems >>>>>>>>> like 2 >>>>>>>>> very long running merges are executing but not much else. >>>>>>>>> 5. I grepped an hour of logs of the 2 machiese for "add >>>>>>>>> merge=", it was 540 on the high load node and 420 on a normal >>>>>>>>> node. I >>>>>>>>> pulled out the size value from the log message and the merges >>>>>>>>> seemed to be >>>>>>>>> much smaller on the high load node. >>>>>>>>> >>>>>>>>> I just created the indices a few days ago, so the shards of each >>>>>>>>> index are balanced across the nodes. We have external metrics around >>>>>>>>> document ingest rate and there was no spike during this time period. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Kireet >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote: >>>>>>>>> >>>>>>>>>> It's perfectly normal/healthy for many small merges below the >>>>>>>>>> floor size to happen. >>>>>>>>>> >>>>>>>>>> I think you should first figure out why this node is different >>>>>>>>>> from the others? Are you sure it's merging CPU cost that's >>>>>>>>>> different? >>>>>>>>>> >>>>>>>>>> Mike McCandless >>>>>>>>>> >>>>>>>>>> http://blog.mikemccandless.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We have a situation where one of the four nodes in our cluster >>>>>>>>>>> seems to get caught up endlessly merging. However it seems to >>>>>>>>>>> be high CPU activity and not I/O constrainted. I have enabled the >>>>>>>>>>> IndexWriter info stream logs, and often times it seems to do merges >>>>>>>>>>> of >>>>>>>>>>> quite small segments (100KB) that are much below the floor size >>>>>>>>>>> (2MB). I >>>>>>>>>>> suspect this is due to frequent refreshes and/or using lots of >>>>>>>>>>> threads >>>>>>>>>>> concurrently to do indexing. Is this true? >>>>>>>>>>> >>>>>>>>>>> My supposition is that this is leading to the merge policy doing >>>>>>>>>>> lots of merges of very small segments into another small segment >>>>>>>>>>> which will >>>>>>>>>>> again require a merge to even reach the floor size. My index has 64 >>>>>>>>>>> segments and 25 are below the floor size. I am wondering if there >>>>>>>>>>> should be >>>>>>>>>>> an exception for the maxMergesAtOnce parameter for the first level >>>>>>>>>>> so that >>>>>>>>>>> many small segments could be merged at once in this case. >>>>>>>>>>> >>>>>>>>>>> I am considering changing the other parameters (wider tiers, >>>>>>>>>>> lower floor size, more concurrent merges allowed) but these all >>>>>>>>>>> seem to >>>>>>>>>>> have side effects I may not necessarily want. Is there a good >>>>>>>>>>> solution here? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "elasticsearch" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0 >>>>>>>>>>> b-49cb-b29d-e396510bf755%40googlegroups.com >>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "elasticsearch" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/edc22069-867 >>>>>>>>> 4-41db-ab06-226b05d293aa%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40goo >>>>>> glegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/a855fbe1-13f3-4c53-8607-a26b8063de1f% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/a855fbe1-13f3-4c53-8607-a26b8063de1f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRdmfFvQoyPr2JQ-1-qskrN6AJNRcSqRG%2BdUBdNwFpjfNA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
