What happens if a ParNew is triggered while CMS is running? Will it wait for the CMS to finish? If so, that would be the eplanation of our long ParNew above.
Regards, Joel 2014-02-20 16:29 GMT+01:00 Joel Samuelsson <samuelsson.j...@gmail.com>: > Hi Frank, > > We got a (quite) long GC pause today on 2.0.5: > INFO [ScheduledTasks:1] 2014-02-20 13:51:14,528 GCInspector.java (line > 116) GC for ParNew: 1627 ms for 1 collections, 425562984 used; max is > 4253024256 > INFO [ScheduledTasks:1] 2014-02-20 13:51:14,542 GCInspector.java (line > 116) GC for ConcurrentMarkSweep: 3703 ms for 2 collections, 434394920 used; > max is 4253024256 > > Unfortunately it's a production cluster so I have no additional GC-logging > enabled. This may be an indication that upgrading is not the (complete) > solution. > > Regards, > Joel > > > 2014-02-17 13:41 GMT+01:00 Benedict Elliott Smith < > belliottsm...@datastax.com>: > > Hi Ondrej, >> >> It's possible you were hit by the problems in this thread before, but it >> looks potentially like you may have other issues. Of course it may be that >> on G1 you have one issue and CMS another, but 27s is extreme even for G1, >> so it seems unlikely. If you're hitting these pause times in CMS and you >> get some more output from the safepoint tracing, please do contribute as I >> would love to get to the bottom of that, however is it possible you're >> experiencing paging activity? Have you made certain the VM memory is locked >> (and preferably that paging is entirely disabled, as the bloom filters and >> other memory won't be locked, although that shouldn't cause pauses during >> GC) >> >> Note that mmapped file accesses and other native work shouldn't in anyway >> inhibit GC activity or other safepoint pause times, unless there's a bug in >> the VM. These threads will simply enter a safepoint as they return to the >> VM execution context, and are considered safe for the duration they are >> outside. >> >> >> >> >> On 17 February 2014 12:30, Ondřej Černoš <cern...@gmail.com> wrote: >> >>> Hi, >>> >>> we tried to switch to G1 because we observed this behaviour on CMS too >>> (27 seconds pause in G1 is quite an advise not to use it). Pauses with CMS >>> were not easily traceable - JVM stopped even without stop-the-world pause >>> scheduled (defragmentation, remarking). We thought the go-to-safepoint >>> waiting time might have been involved (we saw waiting for safepoint >>> resolution) - especially because access to mmpaped files is not preemptive, >>> afaik, but it doesn't explain tens of seconds waiting times, even slow IO >>> should read our sstables into memory in much less time. We switched to G1 >>> out of desperation - and to try different code paths - not that we'd >>> thought it was a great idea. So I think we were hit by the problem >>> discussed in this thread, just the G1 report wasn't very clear, sorry. >>> >>> regards, >>> ondrej >>> >>> >>> >>> On Mon, Feb 17, 2014 at 11:45 AM, Benedict Elliott Smith < >>> belliottsm...@datastax.com> wrote: >>> >>>> Ondrej, >>>> >>>> It seems like your issue is much less difficult to diagnose: your >>>> collection times are long. At least, the pause you printed the time for is >>>> all attributable to the G1 pause. >>>> >>>> Note that G1 has not generally performed well with Cassandra in our >>>> testing. There are a number of changes going in soon that may change that, >>>> but for the time being it is advisable to stick with CMS. With tuning you >>>> can no doubt bring your pauses down considerably. >>>> >>>> >>>> On 17 February 2014 10:17, Ondřej Černoš <cern...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> we are seeing the same kind of long pauses in Cassandra. We tried to >>>>> switch CMS to G1 without positive result. The stress test is read heavy, 2 >>>>> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in >>>>> latency on 99.99 percentil and higher, caused by threads being stopped in >>>>> JVM. >>>>> >>>>> The GC in G1 looks like this: >>>>> >>>>> {Heap before GC invocations=4073 (full 1): >>>>> garbage-first heap total 8388608K, used 3602914K >>>>> [0x00000005f5c00000, 0x00000007f5c00000, 0x00000007f5c00000) >>>>> region size 4096K, 142 young (581632K), 11 survivors (45056K) >>>>> compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, >>>>> 0x00000007f7800000, 0x0000000800000000) >>>>> the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, >>>>> 0x00000007f76c9200, 0x00000007f7800000) >>>>> No shared spaces configured. >>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation >>>>> Pause) (young) >>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15) >>>>> - age 1: 17213632 bytes, 17213632 total >>>>> - age 2: 19391208 bytes, 36604840 total >>>>> , 0.1664300 secs] >>>>> [Parallel Time: 163.9 ms, GC Workers: 2] >>>>> [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max: >>>>> 222346218.3, Diff: 0.0] >>>>> [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7, >>>>> Sum: 13.7] >>>>> [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum: >>>>> 42.6] >>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, >>>>> Sum: 120] >>>>> [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum: >>>>> 46.5] >>>>> [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1, >>>>> Sum: 224.6] >>>>> [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: >>>>> 0.1] >>>>> [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, >>>>> Sum: 0.1] >>>>> [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff: >>>>> 0.0, Sum: 327.6] >>>>> [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max: >>>>> 222346382.1, Diff: 0.0] >>>>> [Code Root Fixup: 0.0 ms] >>>>> [Clear CT: 0.4 ms] >>>>> [Other: 2.1 ms] >>>>> [Choose CSet: 0.0 ms] >>>>> [Ref Proc: 1.1 ms] >>>>> [Ref Enq: 0.0 ms] >>>>> [Free CSet: 0.4 ms] >>>>> [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap: >>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)] >>>>> Heap after GC invocations=4074 (full 1): >>>>> garbage-first heap total 8388608K, used 3090914K [0x00000005f5c00000, >>>>> 0x00000007f5c00000, 0x00000007f5c00000) >>>>> region size 4096K, 17 young (69632K), 17 survivors (69632K) >>>>> compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, >>>>> 0x00000007f7800000, 0x0000000800000000) >>>>> the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, >>>>> 0x00000007f76c9200, 0x00000007f7800000) >>>>> No shared spaces configured. >>>>> } >>>>> [Times: user=0.35 sys=0.00, real=27.58 secs] >>>>> 222346.219: G1IncCollectionPause [ 111 0 >>>>> 0 ] [ 0 0 0 0 27586 ] 0 >>>>> >>>>> And the total thime for which application threads were stopped is >>>>> 27.58 seconds. >>>>> >>>>> CMS behaves in a similar manner. We thought it would be GC, waiting >>>>> for mmaped files being read from disk (the thread cannot reach safepoint >>>>> during this operation), but it doesn't explain the huge time. >>>>> >>>>> We'll try jhiccup to see if it provides any additional information. >>>>> The test was done on mixed aws/openstack environment, openjdk 1.7.0_45, >>>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us. >>>>> >>>>> regards, >>>>> >>>>> ondrej cernos >>>>> >>>>> >>>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng <fnt...@gmail.com> wrote: >>>>> >>>>>> Sorry, I have not had a chance to file a JIRA ticket. We have not >>>>>> been able to resolve the issue. But since Joel mentioned that upgrading >>>>>> to >>>>>> Cassandra 2.0.X solved it for them, we may need to upgrade. We are >>>>>> currently on Java 1.7 and Cassandra 1.2.8 >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright >>>>>> <kwri...@nanigans.com>wrote: >>>>>> >>>>>>> You’re running 2.0.* in production? May I ask what C* version and >>>>>>> OS? Any hardware details would be appreciated as well. Thx! >>>>>>> >>>>>>> From: Joel Samuelsson <samuelsson.j...@gmail.com> >>>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>> Date: Thursday, February 13, 2014 at 11:39 AM >>>>>>> >>>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>> Subject: Re: Intermittent long application pauses on nodes >>>>>>> >>>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 >>>>>>> seems to have helped our issues. >>>>>>> >>>>>>> >>>>>>> 2014-02-13 Keith Wright <kwri...@nanigans.com>: >>>>>>> >>>>>>>> Frank did you ever file a ticket for this issue or find the root >>>>>>>> cause? I believe we are seeing the same issues when attempting to >>>>>>>> bootstrap. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> From: Robert Coli <rc...@eventbrite.com> >>>>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>>> Date: Monday, February 3, 2014 at 6:10 PM >>>>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>>>>>>> Subject: Re: Intermittent long application pauses on nodes >>>>>>>> >>>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith < >>>>>>>> belliottsm...@datastax.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> It's possible that this is a JVM issue, but if so there may be >>>>>>>>> some remedial action we can take anyway. There are some more flags we >>>>>>>>> should add, but we can discuss that once you open a ticket. If you >>>>>>>>> could >>>>>>>>> include the strange JMX error as well, that might be helpful. >>>>>>>>> >>>>>>>> >>>>>>>> It would be appreciated if you could inform this thread of the JIRA >>>>>>>> ticket number, for the benefit of the community and google searchers. >>>>>>>> :) >>>>>>>> >>>>>>>> =Rob >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >