Re: Intermittent long application pauses on nodes

2014-10-31 Thread graham sanderson
I have to admit that I haven’t tried the SafepointTimeout (I just noticed that 
it was actually a production VM option in the JVM code, after my initial 
suggestions below for debugging without it).

There doesn’t seem to be an obvious bug in SafepointTimeout, though I may not 
be looking at the same version of the JVM source. That said there IS some code 
executed after waiting (up to the timeout) for threads to reach safe point 
before it actually records the sync time, so maybe that code is slow for some 
reason. If you were running with TraceSafepoint you’d see this (but that 
requires a debug build of the JVM) because there is an extra trace statement 
here.
 
I’d either try some suggestions below (though if what I said above is the case 
they may not help you as much since the other threads may be well behaved, 
though they may still give you some insight), but firstly, for sure, I’d try to 
see if SafepointTimeout is working at all by setting a ridiculously low timeout 
delay and seeing if you catch anything.

> On Oct 31, 2014, at 2:42 PM, Dan van Kley  wrote:
> 
> Well I tried the SafepointTimeout option, but unfortunately it seems like the 
> long safepoint syncs don't actually trigger the SafepointTimeout mechanism, 
> so we didn't get any logs on it. It's possible I'm just doing it wrong, I 
> used the following options:
> 
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions 
> -XX:LogFile=/var/log/cassandra/stdout.log -XX:+LogVMOutput"
> JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
> JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
> JVM_OPTS="$JVM_OPTS -XX:SafepointTimeoutDelay=4000"
> JVM_OPTS="$JVM_OPTS -XX:+SafepointTimeout”
> 
> and saw the safepoint logging as usual in that stdout.log file, but no 
> timeout logging in either that file or the GC log when safepoint syncs 
> exceeded the timeout. It also seems possible that SafepointTimeout doesn't 
> work on long syncs, see 
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html
>  
> .
> 
> That being the case, any other ideas or suggestions would be appreciated. 
> Thanks!
> 
> On Mon, Oct 27, 2014 at 9:44 AM, Dan van Kley  > wrote:
> Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try and 
> see if that gives us anything to act on.
> 
> On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson  > wrote:
> And -XX:SafepointTimeoutDelay=xxx
> 
> to set how long before it dumps output (defaults to 1 I believe)…
> 
> Note it doesn’t actually timeout by default, it just prints the problematic 
> threads after that time and keeps on waiting
> 
>> On Oct 24, 2014, at 2:44 PM, graham sanderson > > wrote:
>> 
>> Actually - there is 
>> 
>> -XX:+SafepointTimeout
>> 
>> which will print out offending threads (assuming you reach a 10 second 
>> pause)…
>> 
>> That is probably your best bet.
>> 
>>> On Oct 24, 2014, at 2:38 PM, graham sanderson >> > wrote:
>>> 
>>> This certainly sounds like a JVM bug.
>>> 
>>> We are running C* 2.0.9 on pretty high end machines with pretty large 
>>> heaps, and don’t seem to have seen this (note we are on 7u67, so that might 
>>> be an interesting data point, though since the old thread predated that 
>>> probably not)
>>> 
>>> 1) From the app/java side, I’d obviously see if you can identify anything 
>>> which always coincides with this - repair, compaction etc
>>> 2) From the VM side (given that this as Benedict mentioned) some threads 
>>> are taking a long time to rendezvous at the safe point, and it is probably 
>>> not application threads, I’d look what GC threads, compiler threads etc 
>>> might be doing. As mentioned it shouldn’t be anything to do with operations 
>>> which run at a safe point anyway (e.g. scavenge)
>>> a) So look at what CMS is doing at the time and see if you can correlate
>>> b) Check Oracle for related bugs - didn’t obviously see any, but there 
>>> have been some complaints related to compilation and safe points
>>> c) Add any compilation tracing you can
>>> d) Kind of important here - see if you can figure out via dtrace, 
>>> system tap, gdb or whatever, what the threads are doing when this happens. 
>>> Sadly it doesn’t look like you can figure out when this is happening (until 
>>> afterwards) unless you have access to a debug JVM build (and can turn on 
>>> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
>>> update within a time period) - if you don’t have access to that, I guess 
>>> you could try and get a dump every 2-3 seconds (you should catch a 9 second 
>>> pause eventually!)
>>> 
 On Oct 24, 2014, at 12:35 PM, Dan van Kley >>> > wrote:
 
 I'm also curious to know if this was ever resolved or if there's any other 
 recommended steps to ta

Re: Intermittent long application pauses on nodes

2014-10-31 Thread Dan van Kley
Well I tried the SafepointTimeout option, but unfortunately it seems like
the long safepoint syncs don't actually trigger the SafepointTimeout
mechanism, so we didn't get any logs on it. It's possible I'm just doing it
wrong, I used the following options:

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
-XX:LogFile=/var/log/cassandra/stdout.log -XX:+LogVMOutput"
JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
JVM_OPTS="$JVM_OPTS -XX:SafepointTimeoutDelay=4000"
JVM_OPTS="$JVM_OPTS -XX:+SafepointTimeout”

and saw the safepoint logging as usual in that stdout.log file, but no
timeout logging in either that file or the GC log when safepoint syncs
exceeded the timeout. It also seems possible that SafepointTimeout doesn't
work on long syncs, see
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html
.

That being the case, any other ideas or suggestions would be appreciated.
Thanks!

On Mon, Oct 27, 2014 at 9:44 AM, Dan van Kley 
wrote:

> Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
> and see if that gives us anything to act on.
>
> On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson  wrote:
>
>> And -XX:SafepointTimeoutDelay=xxx
>>
>> to set how long before it dumps output (defaults to 1 I believe)…
>>
>> Note it doesn’t actually timeout by default, it just prints the
>> problematic threads after that time and keeps on waiting
>>
>> On Oct 24, 2014, at 2:44 PM, graham sanderson  wrote:
>>
>> Actually - there is
>>
>> -XX:+SafepointTimeout
>>
>> which will print out offending threads (assuming you reach a 10 second
>> pause)…
>>
>> That is probably your best bet.
>>
>> On Oct 24, 2014, at 2:38 PM, graham sanderson  wrote:
>>
>> This certainly *sounds* like a JVM bug.
>>
>> We are running C* 2.0.9 on pretty high end machines with pretty large
>> heaps, and don’t seem to have seen this (note we are on 7u67, so that might
>> be an interesting data point, though since the old thread predated that
>> probably not)
>>
>> 1) From the app/java side, I’d obviously see if you can identify anything
>> which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads
>> are taking a long time to rendezvous at the safe point, and it is probably
>> not application threads, I’d look what GC threads, compiler threads etc
>> might be doing. As mentioned it shouldn’t be anything to do with operations
>> which run at a safe point anyway (e.g. scavenge)
>> a) So look at what CMS is doing at the time and see if you can correlate
>> b) Check Oracle for related bugs - didn’t obviously see any, but there
>> have been some complaints related to compilation and safe points
>> c) Add any compilation tracing you can
>> d) Kind of important here - see if you can figure out via dtrace, system
>> tap, gdb or whatever, what the threads are doing when this happens. Sadly
>> it doesn’t look like you can figure out when this is happening (until
>> afterwards) unless you have access to a debug JVM build (and can turn on
>> -XX:+TraceSafepoint and look for a safe point start without a corresponding
>> update within a time period) - if you don’t have access to that, I guess
>> you could try and get a dump every 2-3 seconds (you should catch a 9 second
>> pause eventually!)
>>
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley 
>> wrote:
>>
>> I'm also curious to know if this was ever resolved or if there's any
>> other recommended steps to take to continue to track it down. I'm seeing
>> the same issue in our production cluster, which is running Cassandra 2.0.10
>> and JVM 1.7u71, using the CMS collector. Just as described above, the issue
>> is long "Total time for which application threads were stopped" pauses that
>> are not a direct result of GC pauses (ParNew, initial mark or remark). When
>> I enabled the safepoint logging I saw the same result, long "sync" pause
>> times with short spin and block times, usually with the "RevokeBias"
>> description. We're seeing pause times sometimes in excess of 10 seconds, so
>> it's a pretty debilitating issue. Our machines are not swapping (or even
>> close to it) or having other load issues when these pauses occur. Any ideas
>> would be very appreciated. Thanks!
>>
>>
>>
>>
>>
>


Re: Intermittent long application pauses on nodes

2014-10-27 Thread Dan van Kley
Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
and see if that gives us anything to act on.

On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson  wrote:

> And -XX:SafepointTimeoutDelay=xxx
>
> to set how long before it dumps output (defaults to 1 I believe)…
>
> Note it doesn’t actually timeout by default, it just prints the
> problematic threads after that time and keeps on waiting
>
> On Oct 24, 2014, at 2:44 PM, graham sanderson  wrote:
>
> Actually - there is
>
> -XX:+SafepointTimeout
>
> which will print out offending threads (assuming you reach a 10 second
> pause)…
>
> That is probably your best bet.
>
> On Oct 24, 2014, at 2:38 PM, graham sanderson  wrote:
>
> This certainly *sounds* like a JVM bug.
>
> We are running C* 2.0.9 on pretty high end machines with pretty large
> heaps, and don’t seem to have seen this (note we are on 7u67, so that might
> be an interesting data point, though since the old thread predated that
> probably not)
>
> 1) From the app/java side, I’d obviously see if you can identify anything
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads
> are taking a long time to rendezvous at the safe point, and it is probably
> not application threads, I’d look what GC threads, compiler threads etc
> might be doing. As mentioned it shouldn’t be anything to do with operations
> which run at a safe point anyway (e.g. scavenge)
> a) So look at what CMS is doing at the time and see if you can correlate
> b) Check Oracle for related bugs - didn’t obviously see any, but there
> have been some complaints related to compilation and safe points
> c) Add any compilation tracing you can
> d) Kind of important here - see if you can figure out via dtrace, system
> tap, gdb or whatever, what the threads are doing when this happens. Sadly
> it doesn’t look like you can figure out when this is happening (until
> afterwards) unless you have access to a debug JVM build (and can turn on
> -XX:+TraceSafepoint and look for a safe point start without a corresponding
> update within a time period) - if you don’t have access to that, I guess
> you could try and get a dump every 2-3 seconds (you should catch a 9 second
> pause eventually!)
>
> On Oct 24, 2014, at 12:35 PM, Dan van Kley 
> wrote:
>
> I'm also curious to know if this was ever resolved or if there's any other
> recommended steps to take to continue to track it down. I'm seeing the same
> issue in our production cluster, which is running Cassandra 2.0.10 and JVM
> 1.7u71, using the CMS collector. Just as described above, the issue is long
> "Total time for which application threads were stopped" pauses that are not
> a direct result of GC pauses (ParNew, initial mark or remark). When I
> enabled the safepoint logging I saw the same result, long "sync" pause
> times with short spin and block times, usually with the "RevokeBias"
> description. We're seeing pause times sometimes in excess of 10 seconds, so
> it's a pretty debilitating issue. Our machines are not swapping (or even
> close to it) or having other load issues when these pauses occur. Any ideas
> would be very appreciated. Thanks!
>
>
>
>
>


Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
And -XX:SafepointTimeoutDelay=xxx

to set how long before it dumps output (defaults to 1 I believe)…

Note it doesn’t actually timeout by default, it just prints the problematic 
threads after that time and keeps on waiting

> On Oct 24, 2014, at 2:44 PM, graham sanderson  wrote:
> 
> Actually - there is 
> 
> -XX:+SafepointTimeout
> 
> which will print out offending threads (assuming you reach a 10 second pause)…
> 
> That is probably your best bet.
> 
>> On Oct 24, 2014, at 2:38 PM, graham sanderson > > wrote:
>> 
>> This certainly sounds like a JVM bug.
>> 
>> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
>> and don’t seem to have seen this (note we are on 7u67, so that might be an 
>> interesting data point, though since the old thread predated that probably 
>> not)
>> 
>> 1) From the app/java side, I’d obviously see if you can identify anything 
>> which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads are 
>> taking a long time to rendezvous at the safe point, and it is probably not 
>> application threads, I’d look what GC threads, compiler threads etc might be 
>> doing. As mentioned it shouldn’t be anything to do with operations which run 
>> at a safe point anyway (e.g. scavenge)
>>  a) So look at what CMS is doing at the time and see if you can correlate
>>  b) Check Oracle for related bugs - didn’t obviously see any, but there 
>> have been some complaints related to compilation and safe points
>>  c) Add any compilation tracing you can
>>  d) Kind of important here - see if you can figure out via dtrace, 
>> system tap, gdb or whatever, what the threads are doing when this happens. 
>> Sadly it doesn’t look like you can figure out when this is happening (until 
>> afterwards) unless you have access to a debug JVM build (and can turn on 
>> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
>> update within a time period) - if you don’t have access to that, I guess you 
>> could try and get a dump every 2-3 seconds (you should catch a 9 second 
>> pause eventually!)
>> 
>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley >> > wrote:
>>> 
>>> I'm also curious to know if this was ever resolved or if there's any other 
>>> recommended steps to take to continue to track it down. I'm seeing the same 
>>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>>> "Total time for which application threads were stopped" pauses that are not 
>>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>>> enabled the safepoint logging I saw the same result, long "sync" pause 
>>> times with short spin and block times, usually with the "RevokeBias" 
>>> description. We're seeing pause times sometimes in excess of 10 seconds, so 
>>> it's a pretty debilitating issue. Our machines are not swapping (or even 
>>> close to it) or having other load issues when these pauses occur. Any ideas 
>>> would be very appreciated. Thanks!
>> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
Actually - there is 

-XX:+SafepointTimeout

which will print out offending threads (assuming you reach a 10 second pause)…

That is probably your best bet.

> On Oct 24, 2014, at 2:38 PM, graham sanderson  wrote:
> 
> This certainly sounds like a JVM bug.
> 
> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
> and don’t seem to have seen this (note we are on 7u67, so that might be an 
> interesting data point, though since the old thread predated that probably 
> not)
> 
> 1) From the app/java side, I’d obviously see if you can identify anything 
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads are 
> taking a long time to rendezvous at the safe point, and it is probably not 
> application threads, I’d look what GC threads, compiler threads etc might be 
> doing. As mentioned it shouldn’t be anything to do with operations which run 
> at a safe point anyway (e.g. scavenge)
>   a) So look at what CMS is doing at the time and see if you can correlate
>   b) Check Oracle for related bugs - didn’t obviously see any, but there 
> have been some complaints related to compilation and safe points
>   c) Add any compilation tracing you can
>   d) Kind of important here - see if you can figure out via dtrace, 
> system tap, gdb or whatever, what the threads are doing when this happens. 
> Sadly it doesn’t look like you can figure out when this is happening (until 
> afterwards) unless you have access to a debug JVM build (and can turn on 
> -XX:+TraceSafepoint and look for a safe point start without a corresponding 
> update within a time period) - if you don’t have access to that, I guess you 
> could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
> eventually!)
> 
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley > > wrote:
>> 
>> I'm also curious to know if this was ever resolved or if there's any other 
>> recommended steps to take to continue to track it down. I'm seeing the same 
>> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
>> 1.7u71, using the CMS collector. Just as described above, the issue is long 
>> "Total time for which application threads were stopped" pauses that are not 
>> a direct result of GC pauses (ParNew, initial mark or remark). When I 
>> enabled the safepoint logging I saw the same result, long "sync" pause times 
>> with short spin and block times, usually with the "RevokeBias" description. 
>> We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
>> debilitating issue. Our machines are not swapping (or even close to it) or 
>> having other load issues when these pauses occur. Any ideas would be very 
>> appreciated. Thanks!
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
This certainly sounds like a JVM bug.

We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
and don’t seem to have seen this (note we are on 7u67, so that might be an 
interesting data point, though since the old thread predated that probably not)

1) From the app/java side, I’d obviously see if you can identify anything which 
always coincides with this - repair, compaction etc
2) From the VM side (given that this as Benedict mentioned) some threads are 
taking a long time to rendezvous at the safe point, and it is probably not 
application threads, I’d look what GC threads, compiler threads etc might be 
doing. As mentioned it shouldn’t be anything to do with operations which run at 
a safe point anyway (e.g. scavenge)
a) So look at what CMS is doing at the time and see if you can correlate
b) Check Oracle for related bugs - didn’t obviously see any, but there 
have been some complaints related to compilation and safe points
c) Add any compilation tracing you can
d) Kind of important here - see if you can figure out via dtrace, 
system tap, gdb or whatever, what the threads are doing when this happens. 
Sadly it doesn’t look like you can figure out when this is happening (until 
afterwards) unless you have access to a debug JVM build (and can turn on 
-XX:+TraceSafepoint and look for a safe point start without a corresponding 
update within a time period) - if you don’t have access to that, I guess you 
could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
eventually!)

> On Oct 24, 2014, at 12:35 PM, Dan van Kley  wrote:
> 
> I'm also curious to know if this was ever resolved or if there's any other 
> recommended steps to take to continue to track it down. I'm seeing the same 
> issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
> 1.7u71, using the CMS collector. Just as described above, the issue is long 
> "Total time for which application threads were stopped" pauses that are not a 
> direct result of GC pauses (ParNew, initial mark or remark). When I enabled 
> the safepoint logging I saw the same result, long "sync" pause times with 
> short spin and block times, usually with the "RevokeBias" description. We're 
> seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
> debilitating issue. Our machines are not swapping (or even close to it) or 
> having other load issues when these pauses occur. Any ideas would be very 
> appreciated. Thanks!



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-24 Thread Dan van Kley
I'm also curious to know if this was ever resolved or if there's any other
recommended steps to take to continue to track it down. I'm seeing the same
issue in our production cluster, which is running Cassandra 2.0.10 and JVM
1.7u71, using the CMS collector. Just as described above, the issue is long
"Total time for which application threads were stopped" pauses that are not
a direct result of GC pauses (ParNew, initial mark or remark). When I
enabled the safepoint logging I saw the same result, long "sync" pause
times with short spin and block times, usually with the "RevokeBias"
description. We're seeing pause times sometimes in excess of 10 seconds, so
it's a pretty debilitating issue. Our machines are not swapping (or even
close to it) or having other load issues when these pauses occur. Any ideas
would be very appreciated. Thanks!


Re: Intermittent long application pauses on nodes

2014-04-14 Thread Ken Hancock
My searching my list archives shows this thread evaporated.  Was a root
cause ever found?  Very curious.




On Mon, Feb 3, 2014 at 11:52 AM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:

> Hi Frank,
>
> The "9391" under RevokeBias is the number of milliseconds spent
> synchronising on the safepoint prior to the VM operation, i.e. the time it
> took to ensure all application threads were stopped. So this is the
> culprit. Notice that the time spent spinning/blocking for the threads we
> are supposed to be waiting on is very low; it looks to me that this is time
> spent waiting for CMS threads to yield, though it is very hard to say with
> absolute certainty. It doesn't look like the issue is actually the
> RevokeBias itself, anyway.
>
> I think we should take this off list. It definitely warrants a ticket,
> though I expect this will be difficult to pin down, so you will have to be
> willing to experiment a bit with us, but we would be very grateful for the
> help. If you can pin down and share a specific workload that triggers this
> we may be able to do it without you though!
>
> It's possible that this is a JVM issue, but if so there may be some
> remedial action we can take anyway. There are some more flags we should
> add, but we can discuss that once you open a ticket. If you could include
> the strange JMX error as well, that might be helpful.
>
> Thanks,
>
> Benedict
>
>
> On 3 February 2014 15:34, Frank Ng  wrote:
>
>> I was able to send SafePointStatistics to another log file via the
>> additional JVM flags and recently noticed a pause of 9.3936600 seconds.
>> Here are the log entries:
>>
>> GC Log file:
>> ---
>> 2014-01-31T07:49:14.755-0500: 137460.842: Total time for which
>> application threads were stopped: 0.1095540 seconds
>> 2014-01-31T07:51:01.870-0500: 137567.957: Total time for which
>> application threads were stopped: 9.3936600 seconds
>> 2014-01-31T07:51:02.537-0500: 137568.623: Total time for which
>> application threads were stopped: 0.1207440 seconds
>>
>> JVM Stdout Log File:
>> ---
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137460.734: GenCollectForAllocation  [ 421
>> 00   ][ 0 0 23  0 84 ]  0
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137558.562: RevokeBias   [ 462
>> 29   ][13 0   9391  1  0 ]  0
>> 
>> > ctxk='javax/management/ObjectName$Property'
>> witness='javax/management/ObjectName$PatternProperty' stamp='137568.503'/>
>> 
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137568.500: Deoptimize   [ 481
>> 15   ][ 0 0118  0  1 ]  0
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137569.625: no vm operation  [ 483
>> 01   ][ 0 0 18  0  0 ]  0
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137571.641: no vm operation  [ 483
>> 01   ][ 0 0 42  1  0 ]  0
>>  vmop [threads: total initially_running
>> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
>> 137575.703: no vm operation  [ 483
>> 01   ][ 0 0 25  1  0 ]  0
>>
>> If SafepointStatistics are printed before the Application Stop times,
>> then it seems that the RevokeBias was the cause of the pause.
>> If SafepointStatistics are printed after the Application Stop times, then
>> it seems that the Deoptimize was the cause of the pause.
>> In addition, I see a strange dependency failed error relating to JMX in
>> the JVM stdout log file.
>>
>> thanks
>>
>>
>> On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith <
>> belliottsm...@datastax.com> wrote:
>>
>>> Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
>>> -XX:+LogVMOutput
>>>
>>> I never figured out what kills stdout for C*. It's a library we depend
>>> on, didn't try too hard to figure out which one.
>>>
>>>
>>> On 29 January 2014 21:07, Frank Ng  wrote:
>>>
 Benedict,
 Thanks for the advice.  I've tried turning on
 PrintSafepointStatistics.  However, that info is only sent to the STDOUT
 console.  The cassandra startup script closes the STDOUT when it finishes,
 so nothing is shown for safepoint statistics once it's done starting up.
 Do you know how to startup cassandra and send all 

Re: Intermittent long application pauses on nodes

2014-02-27 Thread Benedict Elliott Smith
>>>>>>>>> Max: 222346218.3, Diff: 0.0]
>>>>>>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff:
>>>>>>>>> 1.7, Sum: 13.7]
>>>>>>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7,
>>>>>>>>> Sum: 42.6]
>>>>>>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22,
>>>>>>>>> Sum: 120]
>>>>>>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1,
>>>>>>>>> Sum: 46.5]
>>>>>>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff:
>>>>>>>>> 0.1, Sum: 224.6]
>>>>>>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
>>>>>>>>> Sum: 0.1]
>>>>>>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff:
>>>>>>>>> 0.0, Sum: 0.1]
>>>>>>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8,
>>>>>>>>> Diff: 0.0, Sum: 327.6]
>>>>>>>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>>>>>>> 222346382.1, Diff: 0.0]
>>>>>>>>>   [Code Root Fixup: 0.0 ms]
>>>>>>>>>   [Clear CT: 0.4 ms]
>>>>>>>>>   [Other: 2.1 ms]
>>>>>>>>>  [Choose CSet: 0.0 ms]
>>>>>>>>>  [Ref Proc: 1.1 ms]
>>>>>>>>>  [Ref Enq: 0.0 ms]
>>>>>>>>>  [Free CSet: 0.4 ms]
>>>>>>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M
>>>>>>>>> Heap: 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>>>>>>> Heap after GC invocations=4074 (full 1):
>>>>>>>>> garbage-first heap   total 8388608K, used 3090914K 
>>>>>>>>> [0x0005f5c0,
>>>>>>>>> 0x0007f5c0, 0x0007f5c0)
>>>>>>>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>>>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>>>>>> 0x0007f780, 0x0008)
>>>>>>>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>>>>>>>> 0x0007f76c9200, 0x0007f780)
>>>>>>>>> No shared spaces configured.
>>>>>>>>> }
>>>>>>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>>>>>>> 222346.219: G1IncCollectionPause [ 111  0
>>>>>>>>>  0]  [ 0 0 0 0 27586]  0
>>>>>>>>>
>>>>>>>>> And the total thime for which application threads were stopped is
>>>>>>>>> 27.58 seconds.
>>>>>>>>>
>>>>>>>>> CMS behaves in a similar manner. We thought it would be GC,
>>>>>>>>> waiting for mmaped files being read from disk (the thread cannot reach
>>>>>>>>> safepoint during this operation), but it doesn't explain the huge 
>>>>>>>>> time.
>>>>>>>>>
>>>>>>>>> We'll try jhiccup to see if it provides any additional
>>>>>>>>> information. The test was done on mixed aws/openstack environment,
>>>>>>>>> openjdk 1.7.0_45, cassandra 1.2.11. Upgrading to 2.0.x is no option 
>>>>>>>>> for us.
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>>
>>>>>>>>> ondrej cernos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng wrote:
>>>>>>>>>
>>>>>>>>>> Sorry, I have not had a chance to file a JIRA ticket.  We have
>>>>>>>>>> not been able to resolve the issue.  But since Joel mentioned that
>>>>>>>>>> upgrading to Cassandra 2.0.X solved it for them, we may need to 
>>>>>>>>>> upgrade.
>>>>>>>>>> We are currently on Java 1.7 and Cassandra 1.2.8
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright <
>>>>>>>>>> kwri...@nanigans.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> You’re running 2.0.* in production?  May I ask what C* version
>>>>>>>>>>> and OS?  Any hardware details would be appreciated as well.  Thx!
>>>>>>>>>>>
>>>>>>>>>>> From: Joel Samuelsson 
>>>>>>>>>>> Reply-To: "user@cassandra.apache.org" >>>>>>>>>> >
>>>>>>>>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>>>>>>>>
>>>>>>>>>>> To: "user@cassandra.apache.org" 
>>>>>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>>>>>
>>>>>>>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to
>>>>>>>>>>> 1.7 seems to have helped our issues.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-02-13 Keith Wright :
>>>>>>>>>>>
>>>>>>>>>>>> Frank did you ever file a ticket for this issue or find the
>>>>>>>>>>>> root cause?  I believe we are seeing the same issues when 
>>>>>>>>>>>> attempting to
>>>>>>>>>>>> bootstrap.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> From: Robert Coli 
>>>>>>>>>>>> Reply-To: "user@cassandra.apache.org" <
>>>>>>>>>>>> user@cassandra.apache.org>
>>>>>>>>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>>>>>>>>> To: "user@cassandra.apache.org" 
>>>>>>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>>>>>>>>> belliottsm...@datastax.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's possible that this is a JVM issue, but if so there may be
>>>>>>>>>>>>> some remedial action we can take anyway. There are some more 
>>>>>>>>>>>>> flags we
>>>>>>>>>>>>> should add, but we can discuss that once you open a ticket. If 
>>>>>>>>>>>>> you could
>>>>>>>>>>>>> include the strange JMX error as well, that might be helpful.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It would be appreciated if you could inform this thread of the
>>>>>>>>>>>> JIRA ticket number, for the benefit of the community and google 
>>>>>>>>>>>> searchers.
>>>>>>>>>>>> :)
>>>>>>>>>>>>
>>>>>>>>>>>> =Rob
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-02-27 Thread Frank Ng
>> switched to G1 out of desperation - and to try different code paths - not
>>>>>> that we'd thought it was a great idea. So I think we were hit by the
>>>>>> problem discussed in this thread, just the G1 report wasn't very clear,
>>>>>> sorry.
>>>>>>
>>>>>> regards,
>>>>>> ondrej
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 17, 2014 at 11:45 AM, Benedict Elliott Smith <
>>>>>> belliottsm...@datastax.com> wrote:
>>>>>>
>>>>>>> Ondrej,
>>>>>>>
>>>>>>> It seems like your issue is much less difficult to diagnose: your
>>>>>>> collection times are long. At least, the pause you printed the time for 
>>>>>>> is
>>>>>>> all attributable to the G1 pause.
>>>>>>>
>>>>>>> Note that G1 has not generally performed well with Cassandra in our
>>>>>>> testing. There are a number of changes going in soon that may change 
>>>>>>> that,
>>>>>>> but for the time being it is advisable to stick with CMS. With tuning 
>>>>>>> you
>>>>>>> can no doubt bring your pauses down considerably.
>>>>>>>
>>>>>>>
>>>>>>> On 17 February 2014 10:17, Ondřej Černoš  wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> we are seeing the same kind of long pauses in Cassandra. We tried
>>>>>>>> to switch CMS to G1 without positive result. The stress test is read 
>>>>>>>> heavy,
>>>>>>>> 2 datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
>>>>>>>> latency on 99.99 percentil and higher, caused by threads being stopped 
>>>>>>>> in
>>>>>>>> JVM.
>>>>>>>>
>>>>>>>> The GC in G1 looks like this:
>>>>>>>>
>>>>>>>> {Heap before GC invocations=4073 (full 1):
>>>>>>>> garbage-first heap   total 8388608K, used 3602914K
>>>>>>>> [0x0005f5c0, 0x0007f5c0, 0x0007f5c0)
>>>>>>>>  region size 4096K, 142 young (581632K), 11 survivors (45056K)
>>>>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>>>>> 0x0007f780, 0x0008)
>>>>>>>>   the space 28672K,  95% used [0x0007f5c0,
>>>>>>>> 0x0007f76c9108, 0x0007f76c9200, 0x0007f780)
>>>>>>>> No shared spaces configured.
>>>>>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation
>>>>>>>> Pause) (young)
>>>>>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
>>>>>>>> - age   1:   17213632 bytes,   17213632 total
>>>>>>>> - age   2:   19391208 bytes,   36604840 total
>>>>>>>> , 0.1664300 secs]
>>>>>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>>>>>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3,
>>>>>>>> Max: 222346218.3, Diff: 0.0]
>>>>>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff:
>>>>>>>> 1.7, Sum: 13.7]
>>>>>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7,
>>>>>>>> Sum: 42.6]
>>>>>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22,
>>>>>>>> Sum: 120]
>>>>>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1,
>>>>>>>> Sum: 46.5]
>>>>>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff:
>>>>>>>> 0.1, Sum: 224.6]
>>>>>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
>>>>>>>> Sum: 0.1]
>>>>>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff:
>>>>>>>> 0.0, Sum: 0.1]
>>>>>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8,
>>>>>>>> Diff: 0.0, Sum: 327.6]
>>&

Re: Intermittent long application pauses on nodes

2014-02-27 Thread Benedict Elliott Smith
t;>>> Ondrej,
>>>>>>
>>>>>> It seems like your issue is much less difficult to diagnose: your
>>>>>> collection times are long. At least, the pause you printed the time for 
>>>>>> is
>>>>>> all attributable to the G1 pause.
>>>>>>
>>>>>> Note that G1 has not generally performed well with Cassandra in our
>>>>>> testing. There are a number of changes going in soon that may change 
>>>>>> that,
>>>>>> but for the time being it is advisable to stick with CMS. With tuning you
>>>>>> can no doubt bring your pauses down considerably.
>>>>>>
>>>>>>
>>>>>> On 17 February 2014 10:17, Ondřej Černoš  wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> we are seeing the same kind of long pauses in Cassandra. We tried to
>>>>>>> switch CMS to G1 without positive result. The stress test is read 
>>>>>>> heavy, 2
>>>>>>> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
>>>>>>> latency on 99.99 percentil and higher, caused by threads being stopped 
>>>>>>> in
>>>>>>> JVM.
>>>>>>>
>>>>>>> The GC in G1 looks like this:
>>>>>>>
>>>>>>> {Heap before GC invocations=4073 (full 1):
>>>>>>> garbage-first heap   total 8388608K, used 3602914K
>>>>>>> [0x0005f5c0, 0x0007f5c0, 0x0007f5c0)
>>>>>>>  region size 4096K, 142 young (581632K), 11 survivors (45056K)
>>>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>>>> 0x0007f780, 0x0008)
>>>>>>>   the space 28672K,  95% used [0x0007f5c0,
>>>>>>> 0x0007f76c9108, 0x0007f76c9200, 0x0007f780)
>>>>>>> No shared spaces configured.
>>>>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation
>>>>>>> Pause) (young)
>>>>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
>>>>>>> - age   1:   17213632 bytes,   17213632 total
>>>>>>> - age   2:   19391208 bytes,   36604840 total
>>>>>>> , 0.1664300 secs]
>>>>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>>>>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
>>>>>>> 222346218.3, Diff: 0.0]
>>>>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff:
>>>>>>> 1.7, Sum: 13.7]
>>>>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7,
>>>>>>> Sum: 42.6]
>>>>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22,
>>>>>>> Sum: 120]
>>>>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum:
>>>>>>> 46.5]
>>>>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff:
>>>>>>> 0.1, Sum: 224.6]
>>>>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
>>>>>>> Sum: 0.1]
>>>>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>>>> Sum: 0.1]
>>>>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8,
>>>>>>> Diff: 0.0, Sum: 327.6]
>>>>>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>>>>> 222346382.1, Diff: 0.0]
>>>>>>>   [Code Root Fixup: 0.0 ms]
>>>>>>>   [Clear CT: 0.4 ms]
>>>>>>>   [Other: 2.1 ms]
>>>>>>>  [Choose CSet: 0.0 ms]
>>>>>>>  [Ref Proc: 1.1 ms]
>>>>>>>      [Ref Enq: 0.0 ms]
>>>>>>>  [Free CSet: 0.4 ms]
>>>>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>>>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>>>>> Heap after GC invocations=4074 (full 1):
>>>>>>> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
>>>>>>> 0x0007f5c0, 0x0007f5c0)
>

Re: Intermittent long application pauses on nodes

2014-02-27 Thread Frank Ng
t is advisable to stick with CMS. With tuning you
>>>>> can no doubt bring your pauses down considerably.
>>>>>
>>>>>
>>>>> On 17 February 2014 10:17, Ondřej Černoš  wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> we are seeing the same kind of long pauses in Cassandra. We tried to
>>>>>> switch CMS to G1 without positive result. The stress test is read heavy, 
>>>>>> 2
>>>>>> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
>>>>>> latency on 99.99 percentil and higher, caused by threads being stopped in
>>>>>> JVM.
>>>>>>
>>>>>> The GC in G1 looks like this:
>>>>>>
>>>>>> {Heap before GC invocations=4073 (full 1):
>>>>>> garbage-first heap   total 8388608K, used 3602914K
>>>>>> [0x0005f5c0, 0x0007f5c0, 0x0007f5c0)
>>>>>>  region size 4096K, 142 young (581632K), 11 survivors (45056K)
>>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>>> 0x0007f780, 0x0008)
>>>>>>   the space 28672K,  95% used [0x0007f5c0,
>>>>>> 0x0007f76c9108, 0x0007f76c9200, 0x0007f780)
>>>>>> No shared spaces configured.
>>>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation
>>>>>> Pause) (young)
>>>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
>>>>>> - age   1:   17213632 bytes,   17213632 total
>>>>>> - age   2:   19391208 bytes,   36604840 total
>>>>>> , 0.1664300 secs]
>>>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>>>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
>>>>>> 222346218.3, Diff: 0.0]
>>>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff:
>>>>>> 1.7, Sum: 13.7]
>>>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7,
>>>>>> Sum: 42.6]
>>>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22,
>>>>>> Sum: 120]
>>>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum:
>>>>>> 46.5]
>>>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff:
>>>>>> 0.1, Sum: 224.6]
>>>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
>>>>>> 0.1]
>>>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>>> Sum: 0.1]
>>>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff:
>>>>>> 0.0, Sum: 327.6]
>>>>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>>>> 222346382.1, Diff: 0.0]
>>>>>>   [Code Root Fixup: 0.0 ms]
>>>>>>   [Clear CT: 0.4 ms]
>>>>>>   [Other: 2.1 ms]
>>>>>>  [Choose CSet: 0.0 ms]
>>>>>>  [Ref Proc: 1.1 ms]
>>>>>>  [Ref Enq: 0.0 ms]
>>>>>>  [Free CSet: 0.4 ms]
>>>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>>>> Heap after GC invocations=4074 (full 1):
>>>>>> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
>>>>>> 0x0007f5c0, 0x0007f5c0)
>>>>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>>> 0x0007f780, 0x0008)
>>>>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>>>>> 0x0007f76c9200, 0x0007f780)
>>>>>> No shared spaces configured.
>>>>>> }
>>>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>>>> 222346.219: G1IncCollectionPause [ 111  0
>>>>>>  0    ]  [     0     0 0 0 27586]  0
>>>>>>
>>>>>> And the total thime for which application threads were stopped is
>>>>>> 27.58 seconds.
>>>>>>
>>>>

Re: Intermittent long application pauses on nodes

2014-02-21 Thread Joel Samuelsson
96K, 142 young (581632K), 11 survivors (45056K)
>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>> 0x0007f780, 0x0008)
>>>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>>>> 0x0007f76c9200, 0x0007f780)
>>>>> No shared spaces configured.
>>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation
>>>>> Pause) (young)
>>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
>>>>> - age   1:   17213632 bytes,   17213632 total
>>>>> - age   2:   19391208 bytes,   36604840 total
>>>>> , 0.1664300 secs]
>>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
>>>>> 222346218.3, Diff: 0.0]
>>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7,
>>>>> Sum: 13.7]
>>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum:
>>>>> 42.6]
>>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22,
>>>>> Sum: 120]
>>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum:
>>>>> 46.5]
>>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1,
>>>>> Sum: 224.6]
>>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
>>>>> 0.1]
>>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>> Sum: 0.1]
>>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff:
>>>>> 0.0, Sum: 327.6]
>>>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>>> 222346382.1, Diff: 0.0]
>>>>>   [Code Root Fixup: 0.0 ms]
>>>>>   [Clear CT: 0.4 ms]
>>>>>   [Other: 2.1 ms]
>>>>>  [Choose CSet: 0.0 ms]
>>>>>  [Ref Proc: 1.1 ms]
>>>>>  [Ref Enq: 0.0 ms]
>>>>>  [Free CSet: 0.4 ms]
>>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>>> Heap after GC invocations=4074 (full 1):
>>>>> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
>>>>> 0x0007f5c0, 0x0007f5c0)
>>>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>>> 0x0007f780, 0x0008)
>>>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>>>> 0x0007f76c9200, 0x0007f780)
>>>>> No shared spaces configured.
>>>>> }
>>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>>> 222346.219: G1IncCollectionPause [ 111  0
>>>>>  0]  [ 0 0 0 0 27586]  0
>>>>>
>>>>> And the total thime for which application threads were stopped is
>>>>> 27.58 seconds.
>>>>>
>>>>> CMS behaves in a similar manner. We thought it would be GC, waiting
>>>>> for mmaped files being read from disk (the thread cannot reach safepoint
>>>>> during this operation), but it doesn't explain the huge time.
>>>>>
>>>>> We'll try jhiccup to see if it provides any additional information.
>>>>> The test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>>>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>>>>
>>>>> regards,
>>>>>
>>>>> ondrej cernos
>>>>>
>>>>>
>>>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:
>>>>>
>>>>>> Sorry, I have not had a chance to file a JIRA ticket.  We have not
>>>>>> been able to resolve the issue.  But since Joel mentioned that upgrading 
>>>>>> to
>>>>>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>>>>>> currently on Java 1.7 and Cassandra 1.2.8
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright 
>>>>>> wrote:
>>>>>>
>>>>>>> You’re running 2.0.* in production?  May I ask what C* version and
>>>>>>> OS?  Any hardware details would be appreciated as well.  Thx!
>>>>>>>
>>>>>>> From: Joel Samuelsson 
>>>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>>>>
>>>>>>> To: "user@cassandra.apache.org" 
>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>
>>>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7
>>>>>>> seems to have helped our issues.
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-13 Keith Wright :
>>>>>>>
>>>>>>>> Frank did you ever file a ticket for this issue or find the root
>>>>>>>> cause?  I believe we are seeing the same issues when attempting to
>>>>>>>> bootstrap.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> From: Robert Coli 
>>>>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>>>>> To: "user@cassandra.apache.org" 
>>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>>
>>>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>>>>> belliottsm...@datastax.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's possible that this is a JVM issue, but if so there may be
>>>>>>>>> some remedial action we can take anyway. There are some more flags we
>>>>>>>>> should add, but we can discuss that once you open a ticket. If you 
>>>>>>>>> could
>>>>>>>>> include the strange JMX error as well, that might be helpful.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It would be appreciated if you could inform this thread of the JIRA
>>>>>>>> ticket number, for the benefit of the community and google searchers. 
>>>>>>>> :)
>>>>>>>>
>>>>>>>> =Rob
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-02-20 Thread Joel Samuelsson
ytes,   17213632 total
>>>> - age   2:   19391208 bytes,   36604840 total
>>>> , 0.1664300 secs]
>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
>>>> 222346218.3, Diff: 0.0]
>>>>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7,
>>>> Sum: 13.7]
>>>>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum:
>>>> 42.6]
>>>> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum:
>>>> 120]
>>>>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum:
>>>> 46.5]
>>>>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1,
>>>> Sum: 224.6]
>>>>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
>>>> 0.1]
>>>>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>> Sum: 0.1]
>>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff:
>>>> 0.0, Sum: 327.6]
>>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>> 222346382.1, Diff: 0.0]
>>>>   [Code Root Fixup: 0.0 ms]
>>>>   [Clear CT: 0.4 ms]
>>>>   [Other: 2.1 ms]
>>>>  [Choose CSet: 0.0 ms]
>>>>  [Ref Proc: 1.1 ms]
>>>>  [Ref Enq: 0.0 ms]
>>>>  [Free CSet: 0.4 ms]
>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>> Heap after GC invocations=4074 (full 1):
>>>> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
>>>> 0x0007f5c0, 0x0007f5c0)
>>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>>> 0x0007f780, 0x0008)
>>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>>> 0x0007f76c9200, 0x0007f780)
>>>> No shared spaces configured.
>>>> }
>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>> 222346.219: G1IncCollectionPause [ 111  0
>>>>0]  [ 0 0 0 0 27586]  0
>>>>
>>>> And the total thime for which application threads were stopped is 27.58
>>>> seconds.
>>>>
>>>> CMS behaves in a similar manner. We thought it would be GC, waiting for
>>>> mmaped files being read from disk (the thread cannot reach safepoint during
>>>> this operation), but it doesn't explain the huge time.
>>>>
>>>> We'll try jhiccup to see if it provides any additional information. The
>>>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>>>
>>>> regards,
>>>>
>>>> ondrej cernos
>>>>
>>>>
>>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:
>>>>
>>>>> Sorry, I have not had a chance to file a JIRA ticket.  We have not
>>>>> been able to resolve the issue.  But since Joel mentioned that upgrading 
>>>>> to
>>>>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>>>>> currently on Java 1.7 and Cassandra 1.2.8
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright 
>>>>> wrote:
>>>>>
>>>>>> You’re running 2.0.* in production?  May I ask what C* version and
>>>>>> OS?  Any hardware details would be appreciated as well.  Thx!
>>>>>>
>>>>>> From: Joel Samuelsson 
>>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>>>
>>>>>> To: "user@cassandra.apache.org" 
>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>
>>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7
>>>>>> seems to have helped our issues.
>>>>>>
>>>>>>
>>>>>> 2014-02-13 Keith Wright :
>>>>>>
>>>>>>> Frank did you ever file a ticket for this issue or find the root
>>>>>>> cause?  I believe we are seeing the same issues when attempting to
>>>>>>> bootstrap.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> From: Robert Coli 
>>>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>>>> To: "user@cassandra.apache.org" 
>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>
>>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>>>> belliottsm...@datastax.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> It's possible that this is a JVM issue, but if so there may be some
>>>>>>>> remedial action we can take anyway. There are some more flags we should
>>>>>>>> add, but we can discuss that once you open a ticket. If you could 
>>>>>>>> include
>>>>>>>> the strange JMX error as well, that might be helpful.
>>>>>>>>
>>>>>>>
>>>>>>> It would be appreciated if you could inform this thread of the JIRA
>>>>>>> ticket number, for the benefit of the community and google searchers. :)
>>>>>>>
>>>>>>> =Rob
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-02-17 Thread Benedict Elliott Smith
gt;>> Sum: 0.1]
>>>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff:
>>> 0.0, Sum: 327.6]
>>>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>> 222346382.1, Diff: 0.0]
>>>   [Code Root Fixup: 0.0 ms]
>>>   [Clear CT: 0.4 ms]
>>>   [Other: 2.1 ms]
>>>  [Choose CSet: 0.0 ms]
>>>  [Ref Proc: 1.1 ms]
>>>  [Ref Enq: 0.0 ms]
>>>  [Free CSet: 0.4 ms]
>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>> Heap after GC invocations=4074 (full 1):
>>> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
>>> 0x0007f5c0, 0x0007f5c0)
>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
>>> 0x0007f780, 0x0008)
>>>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
>>> 0x0007f76c9200, 0x0007f780)
>>> No shared spaces configured.
>>> }
>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>> 222346.219: G1IncCollectionPause [ 111  0
>>>0]  [ 0 0 0 0 27586]  0
>>>
>>> And the total thime for which application threads were stopped is 27.58
>>> seconds.
>>>
>>> CMS behaves in a similar manner. We thought it would be GC, waiting for
>>> mmaped files being read from disk (the thread cannot reach safepoint during
>>> this operation), but it doesn't explain the huge time.
>>>
>>> We'll try jhiccup to see if it provides any additional information. The
>>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>>
>>> regards,
>>>
>>> ondrej cernos
>>>
>>>
>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:
>>>
>>>> Sorry, I have not had a chance to file a JIRA ticket.  We have not been
>>>> able to resolve the issue.  But since Joel mentioned that upgrading to
>>>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>>>> currently on Java 1.7 and Cassandra 1.2.8
>>>>
>>>>
>>>>
>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wrote:
>>>>
>>>>> You’re running 2.0.* in production?  May I ask what C* version and OS?
>>>>>  Any hardware details would be appreciated as well.  Thx!
>>>>>
>>>>> From: Joel Samuelsson 
>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>>
>>>>> To: "user@cassandra.apache.org" 
>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>
>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7
>>>>> seems to have helped our issues.
>>>>>
>>>>>
>>>>> 2014-02-13 Keith Wright :
>>>>>
>>>>>> Frank did you ever file a ticket for this issue or find the root
>>>>>> cause?  I believe we are seeing the same issues when attempting to
>>>>>> bootstrap.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> From: Robert Coli 
>>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>>> To: "user@cassandra.apache.org" 
>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>
>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>>> belliottsm...@datastax.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> It's possible that this is a JVM issue, but if so there may be some
>>>>>>> remedial action we can take anyway. There are some more flags we should
>>>>>>> add, but we can discuss that once you open a ticket. If you could 
>>>>>>> include
>>>>>>> the strange JMX error as well, that might be helpful.
>>>>>>>
>>>>>>
>>>>>> It would be appreciated if you could inform this thread of the JIRA
>>>>>> ticket number, for the benefit of the community and google searchers. :)
>>>>>>
>>>>>> =Rob
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-02-17 Thread Ondřej Černoš
topped is 27.58
>> seconds.
>>
>> CMS behaves in a similar manner. We thought it would be GC, waiting for
>> mmaped files being read from disk (the thread cannot reach safepoint during
>> this operation), but it doesn't explain the huge time.
>>
>> We'll try jhiccup to see if it provides any additional information. The
>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>
>> regards,
>>
>> ondrej cernos
>>
>>
>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:
>>
>>> Sorry, I have not had a chance to file a JIRA ticket.  We have not been
>>> able to resolve the issue.  But since Joel mentioned that upgrading to
>>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>>> currently on Java 1.7 and Cassandra 1.2.8
>>>
>>>
>>>
>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wrote:
>>>
>>>> You’re running 2.0.* in production?  May I ask what C* version and OS?
>>>>  Any hardware details would be appreciated as well.  Thx!
>>>>
>>>> From: Joel Samuelsson 
>>>> Reply-To: "user@cassandra.apache.org" 
>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>
>>>> To: "user@cassandra.apache.org" 
>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>
>>>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7
>>>> seems to have helped our issues.
>>>>
>>>>
>>>> 2014-02-13 Keith Wright :
>>>>
>>>>> Frank did you ever file a ticket for this issue or find the root
>>>>> cause?  I believe we are seeing the same issues when attempting to
>>>>> bootstrap.
>>>>>
>>>>> Thanks
>>>>>
>>>>> From: Robert Coli 
>>>>> Reply-To: "user@cassandra.apache.org" 
>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>> To: "user@cassandra.apache.org" 
>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>
>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>> belliottsm...@datastax.com> wrote:
>>>>>
>>>>>>
>>>>>> It's possible that this is a JVM issue, but if so there may be some
>>>>>> remedial action we can take anyway. There are some more flags we should
>>>>>> add, but we can discuss that once you open a ticket. If you could include
>>>>>> the strange JMX error as well, that might be helpful.
>>>>>>
>>>>>
>>>>> It would be appreciated if you could inform this thread of the JIRA
>>>>> ticket number, for the benefit of the community and google searchers. :)
>>>>>
>>>>> =Rob
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-02-17 Thread Benedict Elliott Smith
Ondrej,

It seems like your issue is much less difficult to diagnose: your
collection times are long. At least, the pause you printed the time for is
all attributable to the G1 pause.

Note that G1 has not generally performed well with Cassandra in our
testing. There are a number of changes going in soon that may change that,
but for the time being it is advisable to stick with CMS. With tuning you
can no doubt bring your pauses down considerably.


On 17 February 2014 10:17, Ondřej Černoš  wrote:

> Hi all,
>
> we are seeing the same kind of long pauses in Cassandra. We tried to
> switch CMS to G1 without positive result. The stress test is read heavy, 2
> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
> latency on 99.99 percentil and higher, caused by threads being stopped in
> JVM.
>
> The GC in G1 looks like this:
>
> {Heap before GC invocations=4073 (full 1):
> garbage-first heap   total 8388608K, used 3602914K [0x0005f5c0,
> 0x0007f5c0, 0x0007f5c0)
>  region size 4096K, 142 young (581632K), 11 survivors (45056K)
> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
> 0x0007f780, 0x0008)
>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
> 0x0007f76c9200, 0x0007f780)
> No shared spaces configured.
> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation Pause)
> (young)
> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
> - age   1:   17213632 bytes,   17213632 total
> - age   2:   19391208 bytes,   36604840 total
> , 0.1664300 secs]
>   [Parallel Time: 163.9 ms, GC Workers: 2]
>  [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
> 222346218.3, Diff: 0.0]
>  [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7,
> Sum: 13.7]
>  [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum:
> 42.6]
> [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum:
> 120]
>  [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum: 46.5]
>  [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1,
> Sum: 224.6]
>  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
>  [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
> 0.1]
>  [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff: 0.0,
> Sum: 327.6]
>  [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
> 222346382.1, Diff: 0.0]
>   [Code Root Fixup: 0.0 ms]
>   [Clear CT: 0.4 ms]
>   [Other: 2.1 ms]
>  [Choose CSet: 0.0 ms]
>  [Ref Proc: 1.1 ms]
>  [Ref Enq: 0.0 ms]
>  [Free CSet: 0.4 ms]
>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
> 3518.5M(8192.0M)->3018.5M(8192.0M)]
> Heap after GC invocations=4074 (full 1):
> garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
> 0x0007f5c0, 0x0007f5c0)
>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
> compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
> 0x0007f780, 0x0008)
>   the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
> 0x0007f76c9200, 0x0007f780)
> No shared spaces configured.
> }
> [Times: user=0.35 sys=0.00, real=27.58 secs]
> 222346.219: G1IncCollectionPause [ 111  0
>  0]  [ 0 0 0 0 27586]  0
>
> And the total thime for which application threads were stopped is 27.58
> seconds.
>
> CMS behaves in a similar manner. We thought it would be GC, waiting for
> mmaped files being read from disk (the thread cannot reach safepoint during
> this operation), but it doesn't explain the huge time.
>
> We'll try jhiccup to see if it provides any additional information. The
> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>
> regards,
>
> ondrej cernos
>
>
> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:
>
>> Sorry, I have not had a chance to file a JIRA ticket.  We have not been
>> able to resolve the issue.  But since Joel mentioned that upgrading to
>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>> currently on Java 1.7 and Cassandra 1.2.8
>>
>>
>>
>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wrote:
>>
>>> You’re running 2.0.* in production?  May I ask what C* version and OS?
>>>  Any hardware details would be appreciated as well.  Thx!
>>>
>>> From: Joel Samuelsson 
>>> Reply-To: "user@cassandra.apache.org" 
>>> Date: Thu

Re: Intermittent long application pauses on nodes

2014-02-17 Thread Ondřej Černoš
Hi all,

we are seeing the same kind of long pauses in Cassandra. We tried to switch
CMS to G1 without positive result. The stress test is read heavy, 2
datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
latency on 99.99 percentil and higher, caused by threads being stopped in
JVM.

The GC in G1 looks like this:

{Heap before GC invocations=4073 (full 1):
garbage-first heap   total 8388608K, used 3602914K [0x0005f5c0,
0x0007f5c0, 0x0007f5c0)
 region size 4096K, 142 young (581632K), 11 survivors (45056K)
compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
0x0007f780, 0x0008)
  the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
0x0007f76c9200, 0x0007f780)
No shared spaces configured.
2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 37748736 bytes, new threshold 15 (max 15)
- age   1:   17213632 bytes,   17213632 total
- age   2:   19391208 bytes,   36604840 total
, 0.1664300 secs]
  [Parallel Time: 163.9 ms, GC Workers: 2]
 [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
222346218.3, Diff: 0.0]
 [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7, Sum:
13.7]
 [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum: 42.6]
[Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum: 120]
 [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum: 46.5]
 [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1, Sum:
224.6]
 [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
 [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum:
0.1]
 [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff: 0.0,
Sum: 327.6]
 [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
222346382.1, Diff: 0.0]
  [Code Root Fixup: 0.0 ms]
  [Clear CT: 0.4 ms]
  [Other: 2.1 ms]
 [Choose CSet: 0.0 ms]
 [Ref Proc: 1.1 ms]
 [Ref Enq: 0.0 ms]
 [Free CSet: 0.4 ms]
  [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
3518.5M(8192.0M)->3018.5M(8192.0M)]
Heap after GC invocations=4074 (full 1):
garbage-first heap   total 8388608K, used 3090914K [0x0005f5c0,
0x0007f5c0, 0x0007f5c0)
 region size 4096K, 17 young (69632K), 17 survivors (69632K)
compacting perm gen  total 28672K, used 27428K [0x0007f5c0,
0x0007f780, 0x0008)
  the space 28672K,  95% used [0x0007f5c0, 0x0007f76c9108,
0x0007f76c9200, 0x0007f780)
No shared spaces configured.
}
[Times: user=0.35 sys=0.00, real=27.58 secs]
222346.219: G1IncCollectionPause [ 111  0
   0]  [ 0 0 0 0 27586]  0

And the total thime for which application threads were stopped is 27.58
seconds.

CMS behaves in a similar manner. We thought it would be GC, waiting for
mmaped files being read from disk (the thread cannot reach safepoint during
this operation), but it doesn't explain the huge time.

We'll try jhiccup to see if it provides any additional information. The
test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
cassandra 1.2.11. Upgrading to 2.0.x is no option for us.

regards,

ondrej cernos


On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng  wrote:

> Sorry, I have not had a chance to file a JIRA ticket.  We have not been
> able to resolve the issue.  But since Joel mentioned that upgrading to
> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
> currently on Java 1.7 and Cassandra 1.2.8
>
>
>
> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wrote:
>
>> You're running 2.0.* in production?  May I ask what C* version and OS?
>>  Any hardware details would be appreciated as well.  Thx!
>>
>> From: Joel Samuelsson 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Thursday, February 13, 2014 at 11:39 AM
>>
>> To: "user@cassandra.apache.org" 
>> Subject: Re: Intermittent long application pauses on nodes
>>
>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7
>> seems to have helped our issues.
>>
>>
>> 2014-02-13 Keith Wright :
>>
>>> Frank did you ever file a ticket for this issue or find the root cause?
>>>  I believe we are seeing the same issues when attempting to bootstrap.
>>>
>>> Thanks
>>>
>>> From: Robert Coli 
>>> Reply-To: "user@cassandra.apache.org" 
>>> Date: Monday, February 3, 2014 at 6:10 PM
>>> To: "user@cassandra.apache.org" 
>>> Subject: Re: Intermittent long application pauses on nodes
>>>
>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>> belliottsm...@datastax.

Re: Intermittent long application pauses on nodes

2014-02-14 Thread Frank Ng
Sorry, I have not had a chance to file a JIRA ticket.  We have not been
able to resolve the issue.  But since Joel mentioned that upgrading to
Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
currently on Java 1.7 and Cassandra 1.2.8



On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright  wrote:

> You're running 2.0.* in production?  May I ask what C* version and OS?
>  Any hardware details would be appreciated as well.  Thx!
>
> From: Joel Samuelsson 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thursday, February 13, 2014 at 11:39 AM
>
> To: "user@cassandra.apache.org" 
> Subject: Re: Intermittent long application pauses on nodes
>
> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 seems
> to have helped our issues.
>
>
> 2014-02-13 Keith Wright :
>
>> Frank did you ever file a ticket for this issue or find the root cause?
>>  I believe we are seeing the same issues when attempting to bootstrap.
>>
>> Thanks
>>
>> From: Robert Coli 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Monday, February 3, 2014 at 6:10 PM
>> To: "user@cassandra.apache.org" 
>> Subject: Re: Intermittent long application pauses on nodes
>>
>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>> belliottsm...@datastax.com> wrote:
>>
>>>
>>> It's possible that this is a JVM issue, but if so there may be some
>>> remedial action we can take anyway. There are some more flags we should
>>> add, but we can discuss that once you open a ticket. If you could include
>>> the strange JMX error as well, that might be helpful.
>>>
>>
>> It would be appreciated if you could inform this thread of the JIRA
>> ticket number, for the benefit of the community and google searchers. :)
>>
>> =Rob
>>
>
>


Re: Intermittent long application pauses on nodes

2014-02-13 Thread Keith Wright
You’re running 2.0.* in production?  May I ask what C* version and OS?  Any 
hardware details would be appreciated as well.  Thx!

From: Joel Samuelsson 
mailto:samuelsson.j...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 13, 2014 at 11:39 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Intermittent long application pauses on nodes

We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 seems to 
have helped our issues.


2014-02-13 Keith Wright mailto:kwri...@nanigans.com>>:
Frank did you ever file a ticket for this issue or find the root cause?  I 
believe we are seeing the same issues when attempting to bootstrap.

Thanks

From: Robert Coli mailto:rc...@eventbrite.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Monday, February 3, 2014 at 6:10 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Intermittent long application pauses on nodes

On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith 
mailto:belliottsm...@datastax.com>> wrote:

It's possible that this is a JVM issue, but if so there may be some remedial 
action we can take anyway. There are some more flags we should add, but we can 
discuss that once you open a ticket. If you could include the strange JMX error 
as well, that might be helpful.

It would be appreciated if you could inform this thread of the JIRA ticket 
number, for the benefit of the community and google searchers. :)

=Rob



Re: Intermittent long application pauses on nodes

2014-02-13 Thread Joel Samuelsson
We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 seems
to have helped our issues.


2014-02-13 Keith Wright :

> Frank did you ever file a ticket for this issue or find the root cause?  I
> believe we are seeing the same issues when attempting to bootstrap.
>
> Thanks
>
> From: Robert Coli 
> Reply-To: "user@cassandra.apache.org" 
> Date: Monday, February 3, 2014 at 6:10 PM
> To: "user@cassandra.apache.org" 
> Subject: Re: Intermittent long application pauses on nodes
>
> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
>>
>> It's possible that this is a JVM issue, but if so there may be some
>> remedial action we can take anyway. There are some more flags we should
>> add, but we can discuss that once you open a ticket. If you could include
>> the strange JMX error as well, that might be helpful.
>>
>
> It would be appreciated if you could inform this thread of the JIRA ticket
> number, for the benefit of the community and google searchers. :)
>
> =Rob
>


Re: Intermittent long application pauses on nodes

2014-02-12 Thread Keith Wright
Frank did you ever file a ticket for this issue or find the root cause?  I 
believe we are seeing the same issues when attempting to bootstrap.

Thanks

From: Robert Coli mailto:rc...@eventbrite.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Monday, February 3, 2014 at 6:10 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Intermittent long application pauses on nodes

On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith 
mailto:belliottsm...@datastax.com>> wrote:

It's possible that this is a JVM issue, but if so there may be some remedial 
action we can take anyway. There are some more flags we should add, but we can 
discuss that once you open a ticket. If you could include the strange JMX error 
as well, that might be helpful.

It would be appreciated if you could inform this thread of the JIRA ticket 
number, for the benefit of the community and google searchers. :)

=Rob


Re: Intermittent long application pauses on nodes

2014-02-03 Thread Robert Coli
On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:

>
> It's possible that this is a JVM issue, but if so there may be some
> remedial action we can take anyway. There are some more flags we should
> add, but we can discuss that once you open a ticket. If you could include
> the strange JMX error as well, that might be helpful.
>

It would be appreciated if you could inform this thread of the JIRA ticket
number, for the benefit of the community and google searchers. :)

=Rob


Re: Intermittent long application pauses on nodes

2014-02-03 Thread Benedict Elliott Smith
Hi Frank,

The "9391" under RevokeBias is the number of milliseconds spent
synchronising on the safepoint prior to the VM operation, i.e. the time it
took to ensure all application threads were stopped. So this is the
culprit. Notice that the time spent spinning/blocking for the threads we
are supposed to be waiting on is very low; it looks to me that this is time
spent waiting for CMS threads to yield, though it is very hard to say with
absolute certainty. It doesn't look like the issue is actually the
RevokeBias itself, anyway.

I think we should take this off list. It definitely warrants a ticket,
though I expect this will be difficult to pin down, so you will have to be
willing to experiment a bit with us, but we would be very grateful for the
help. If you can pin down and share a specific workload that triggers this
we may be able to do it without you though!

It's possible that this is a JVM issue, but if so there may be some
remedial action we can take anyway. There are some more flags we should
add, but we can discuss that once you open a ticket. If you could include
the strange JMX error as well, that might be helpful.

Thanks,

Benedict


On 3 February 2014 15:34, Frank Ng  wrote:

> I was able to send SafePointStatistics to another log file via the
> additional JVM flags and recently noticed a pause of 9.3936600 seconds.
> Here are the log entries:
>
> GC Log file:
> ---
> 2014-01-31T07:49:14.755-0500: 137460.842: Total time for which application
> threads were stopped: 0.1095540 seconds
> 2014-01-31T07:51:01.870-0500: 137567.957: Total time for which application
> threads were stopped: 9.3936600 seconds
> 2014-01-31T07:51:02.537-0500: 137568.623: Total time for which application
> threads were stopped: 0.1207440 seconds
>
> JVM Stdout Log File:
> ---
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137460.734: GenCollectForAllocation  [ 421
> 00   ][ 0 0 23  0 84 ]  0
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137558.562: RevokeBias   [ 462
> 29   ][13 0   9391  1  0 ]  0
> 
>  ctxk='javax/management/ObjectName$Property'
> witness='javax/management/ObjectName$PatternProperty' stamp='137568.503'/>
> 
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137568.500: Deoptimize   [ 481
> 15   ][ 0 0118  0  1 ]  0
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137569.625: no vm operation  [ 483
> 01   ][ 0 0 18  0  0 ]  0
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137571.641: no vm operation  [ 483
> 01   ][ 0 0 42  1  0 ]  0
>  vmop [threads: total initially_running
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 137575.703: no vm operation  [ 483
> 01   ][ 0 0 25  1  0 ]  0
>
> If SafepointStatistics are printed before the Application Stop times, then
> it seems that the RevokeBias was the cause of the pause.
> If SafepointStatistics are printed after the Application Stop times, then
> it seems that the Deoptimize was the cause of the pause.
> In addition, I see a strange dependency failed error relating to JMX in
> the JVM stdout log file.
>
> thanks
>
>
> On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
>> Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
>> -XX:+LogVMOutput
>>
>> I never figured out what kills stdout for C*. It's a library we depend
>> on, didn't try too hard to figure out which one.
>>
>>
>> On 29 January 2014 21:07, Frank Ng  wrote:
>>
>>> Benedict,
>>> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
>>> However, that info is only sent to the STDOUT console.  The cassandra
>>> startup script closes the STDOUT when it finishes, so nothing is shown for
>>> safepoint statistics once it's done starting up.  Do you know how to
>>> startup cassandra and send all stdout to a log file and tell cassandra not
>>> to close stdout?
>>>
>>> Also, we have swap turned off as recommended.
>>>
>>> thanks
>>>
>>>
>>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
>>> belliottsm...@datastax.com> wrote:
>>>
 Frank,


 The same advice for investigating holds: add the VM 

Re: Intermittent long application pauses on nodes

2014-02-03 Thread Frank Ng
I was able to send SafePointStatistics to another log file via the
additional JVM flags and recently noticed a pause of 9.3936600 seconds.
Here are the log entries:

GC Log file:
---
2014-01-31T07:49:14.755-0500: 137460.842: Total time for which application
threads were stopped: 0.1095540 seconds
2014-01-31T07:51:01.870-0500: 137567.957: Total time for which application
threads were stopped: 9.3936600 seconds
2014-01-31T07:51:02.537-0500: 137568.623: Total time for which application
threads were stopped: 0.1207440 seconds

JVM Stdout Log File:
---
 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137460.734: GenCollectForAllocation  [ 421
00   ][ 0 0 23  0 84 ]  0
 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137558.562: RevokeBias   [ 462
29   ][13 0   9391  1  0 ]  0



 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137568.500: Deoptimize   [ 481
15   ][ 0 0118  0  1 ]  0
 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137569.625: no vm operation  [ 483
01   ][ 0 0 18  0  0 ]  0
 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137571.641: no vm operation  [ 483
01   ][ 0 0 42  1  0 ]  0
 vmop [threads: total initially_running
wait_to_block][time: spin block sync cleanup vmop] page_trap_count
137575.703: no vm operation  [ 483
01   ][ 0 0 25  1  0 ]  0

If SafepointStatistics are printed before the Application Stop times, then
it seems that the RevokeBias was the cause of the pause.
If SafepointStatistics are printed after the Application Stop times, then
it seems that the Deoptimize was the cause of the pause.
In addition, I see a strange dependency failed error relating to JMX in the
JVM stdout log file.

thanks


On Wed, Jan 29, 2014 at 4:44 PM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:

> Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
> -XX:+LogVMOutput
>
> I never figured out what kills stdout for C*. It's a library we depend on,
> didn't try too hard to figure out which one.
>
>
> On 29 January 2014 21:07, Frank Ng  wrote:
>
>> Benedict,
>> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
>> However, that info is only sent to the STDOUT console.  The cassandra
>> startup script closes the STDOUT when it finishes, so nothing is shown for
>> safepoint statistics once it's done starting up.  Do you know how to
>> startup cassandra and send all stdout to a log file and tell cassandra not
>> to close stdout?
>>
>> Also, we have swap turned off as recommended.
>>
>> thanks
>>
>>
>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
>> belliottsm...@datastax.com> wrote:
>>
>>> Frank,
>>>
>>>
>>> The same advice for investigating holds: add the VM flags 
>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you 
>>> could put something above 1 there, to reduce the amount of logging, since a 
>>> pause of 52s will be pretty obvious even if aggregated with lots of other 
>>> safe points; the count is the number of safepoints to aggregate into one 
>>> log message)
>>>
>>>
>>> 52s is a very extreme pause, and I would be surprised if revoke bias could 
>>> cause this. I wonder if the VM is swapping out.
>>>
>>>
>>>
>>> On 29 January 2014 19:02, Frank Ng  wrote:
>>>
 Thanks for the update.  Our logs indicated that there were 0 pending
 for CompactionManager at that time.  Also, there were no nodetool repairs
 running at that time.  The log statements above state that the application
 had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
 the safepoint.


 On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
 shaochuan.w...@bloomreach.com> wrote:

> We had similar latency spikes when pending compactions can't keep it
> up or repair/streaming taking too much cycles.
>
>
> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:
>
>> All,
>>
>> We've been having intermittent long application pauses (version
>> 1.2.8) and not sure if it's a cassandra bug.  During these pauses, there
>> are dropped messages in the cassandra log file along with the node seeing
>> other nodes as d

Re: Intermittent long application pauses on nodes

2014-01-30 Thread Sylvain Lebresne
>
>
> I never figured out what kills stdout for C*. It's a library we depend on,
> didn't try too hard to figure out which one.
>

Nah, it's Cassandra itself (in
org.apache.cassandra.service.CassandraDaemon.activate()), but you can pass
-f (for 'foreground') to not do it.


>
>
> On 29 January 2014 21:07, Frank Ng  wrote:
>
>> Benedict,
>> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
>> However, that info is only sent to the STDOUT console.  The cassandra
>> startup script closes the STDOUT when it finishes, so nothing is shown for
>> safepoint statistics once it's done starting up.  Do you know how to
>> startup cassandra and send all stdout to a log file and tell cassandra not
>> to close stdout?
>>
>> Also, we have swap turned off as recommended.
>>
>> thanks
>>
>>
>> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
>> belliottsm...@datastax.com> wrote:
>>
>>> Frank,
>>>
>>>
>>> The same advice for investigating holds: add the VM flags 
>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you 
>>> could put something above 1 there, to reduce the amount of logging, since a 
>>> pause of 52s will be pretty obvious even if aggregated with lots of other 
>>> safe points; the count is the number of safepoints to aggregate into one 
>>> log message)
>>>
>>>
>>> 52s is a very extreme pause, and I would be surprised if revoke bias could 
>>> cause this. I wonder if the VM is swapping out.
>>>
>>>
>>>
>>> On 29 January 2014 19:02, Frank Ng  wrote:
>>>
 Thanks for the update.  Our logs indicated that there were 0 pending
 for CompactionManager at that time.  Also, there were no nodetool repairs
 running at that time.  The log statements above state that the application
 had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
 the safepoint.


 On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
 shaochuan.w...@bloomreach.com> wrote:

> We had similar latency spikes when pending compactions can't keep it
> up or repair/streaming taking too much cycles.
>
>
> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:
>
>> All,
>>
>> We've been having intermittent long application pauses (version
>> 1.2.8) and not sure if it's a cassandra bug.  During these pauses, there
>> are dropped messages in the cassandra log file along with the node seeing
>> other nodes as down.  We've turned on gc logging and the following is an
>> example of a long "stopped" or pause event in the gc.log file.
>>
>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>> application threads were stopped: 0.091450 seconds
>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>> application threads were stopped: 51.8190260 seconds
>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>> application threads were stopped: 0.005470 seconds
>>
>> As seen above, there was a 0.091450 secs pause, then a 51.8190260
>> secs pause.  There were no GC log events between those 2 log statements.
>> Since there's no GC logs in between, something else must be causing the
>> long stop time to reach a safepoint.
>>
>> Could there be a Cassandra thread that is taking a long time to reach
>> a safepoint and what is it trying to do? Along with the node seeing other
>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>> Pending in ReadStage and 9 Pending in MutationStage.
>>
>> There is mention of cassandra batch revoke bias locks as a possible
>> cause (not GC) via:
>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>
>> We have JNA, no swap, and the cluster runs fine besides there
>> intermittent long pause that can cause a node to appear down to other
>> nodes.  Any ideas as the cause of the long pause above? It seems not
>> related to GC.
>>
>> thanks.
>>
>>
>

>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-01-29 Thread Benedict Elliott Smith
Add some more flags: -XX:+UnlockDiagnosticVMOptions -XX:LogFile=${path}
-XX:+LogVMOutput

I never figured out what kills stdout for C*. It's a library we depend on,
didn't try too hard to figure out which one.


On 29 January 2014 21:07, Frank Ng  wrote:

> Benedict,
> Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
> However, that info is only sent to the STDOUT console.  The cassandra
> startup script closes the STDOUT when it finishes, so nothing is shown for
> safepoint statistics once it's done starting up.  Do you know how to
> startup cassandra and send all stdout to a log file and tell cassandra not
> to close stdout?
>
> Also, we have swap turned off as recommended.
>
> thanks
>
>
> On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
>> Frank,
>>
>>
>> The same advice for investigating holds: add the VM flags 
>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you 
>> could put something above 1 there, to reduce the amount of logging, since a 
>> pause of 52s will be pretty obvious even if aggregated with lots of other 
>> safe points; the count is the number of safepoints to aggregate into one log 
>> message)
>>
>>
>> 52s is a very extreme pause, and I would be surprised if revoke bias could 
>> cause this. I wonder if the VM is swapping out.
>>
>>
>>
>> On 29 January 2014 19:02, Frank Ng  wrote:
>>
>>> Thanks for the update.  Our logs indicated that there were 0 pending for
>>> CompactionManager at that time.  Also, there were no nodetool repairs
>>> running at that time.  The log statements above state that the application
>>> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
>>> the safepoint.
>>>
>>>
>>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
>>> shaochuan.w...@bloomreach.com> wrote:
>>>
 We had similar latency spikes when pending compactions can't keep it up
 or repair/streaming taking too much cycles.


 On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:

> All,
>
> We've been having intermittent long application pauses (version 1.2.8)
> and not sure if it's a cassandra bug.  During these pauses, there are
> dropped messages in the cassandra log file along with the node seeing 
> other
> nodes as down.  We've turned on gc logging and the following is an example
> of a long "stopped" or pause event in the gc.log file.
>
> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
> application threads were stopped: 0.091450 seconds
> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
> application threads were stopped: 51.8190260 seconds
> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
> application threads were stopped: 0.005470 seconds
>
> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
> pause.  There were no GC log events between those 2 log statements.  Since
> there's no GC logs in between, something else must be causing the long 
> stop
> time to reach a safepoint.
>
> Could there be a Cassandra thread that is taking a long time to reach
> a safepoint and what is it trying to do? Along with the node seeing other
> nodes as down in the cassandra log file, the StatusLogger shows 1599
> Pending in ReadStage and 9 Pending in MutationStage.
>
> There is mention of cassandra batch revoke bias locks as a possible
> cause (not GC) via:
> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>
> We have JNA, no swap, and the cluster runs fine besides there
> intermittent long pause that can cause a node to appear down to other
> nodes.  Any ideas as the cause of the long pause above? It seems not
> related to GC.
>
> thanks.
>
>

>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-01-29 Thread Frank Ng
Benedict,
Thanks for the advice.  I've tried turning on PrintSafepointStatistics.
However, that info is only sent to the STDOUT console.  The cassandra
startup script closes the STDOUT when it finishes, so nothing is shown for
safepoint statistics once it's done starting up.  Do you know how to
startup cassandra and send all stdout to a log file and tell cassandra not
to close stdout?

Also, we have swap turned off as recommended.

thanks


On Wed, Jan 29, 2014 at 3:39 PM, Benedict Elliott Smith <
belliottsm...@datastax.com> wrote:

> Frank,
>
>
> The same advice for investigating holds: add the VM flags 
> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1   (you 
> could put something above 1 there, to reduce the amount of logging, since a 
> pause of 52s will be pretty obvious even if aggregated with lots of other 
> safe points; the count is the number of safepoints to aggregate into one log 
> message)
>
>
> 52s is a very extreme pause, and I would be surprised if revoke bias could 
> cause this. I wonder if the VM is swapping out.
>
>
>
> On 29 January 2014 19:02, Frank Ng  wrote:
>
>> Thanks for the update.  Our logs indicated that there were 0 pending for
>> CompactionManager at that time.  Also, there were no nodetool repairs
>> running at that time.  The log statements above state that the application
>> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
>> the safepoint.
>>
>>
>> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
>> shaochuan.w...@bloomreach.com> wrote:
>>
>>> We had similar latency spikes when pending compactions can't keep it up
>>> or repair/streaming taking too much cycles.
>>>
>>>
>>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:
>>>
 All,

 We've been having intermittent long application pauses (version 1.2.8)
 and not sure if it's a cassandra bug.  During these pauses, there are
 dropped messages in the cassandra log file along with the node seeing other
 nodes as down.  We've turned on gc logging and the following is an example
 of a long "stopped" or pause event in the gc.log file.

 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
 application threads were stopped: 0.091450 seconds
 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
 application threads were stopped: 51.8190260 seconds
 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
 application threads were stopped: 0.005470 seconds

 As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
 pause.  There were no GC log events between those 2 log statements.  Since
 there's no GC logs in between, something else must be causing the long stop
 time to reach a safepoint.

 Could there be a Cassandra thread that is taking a long time to reach a
 safepoint and what is it trying to do? Along with the node seeing other
 nodes as down in the cassandra log file, the StatusLogger shows 1599
 Pending in ReadStage and 9 Pending in MutationStage.

 There is mention of cassandra batch revoke bias locks as a possible
 cause (not GC) via:
 http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html

 We have JNA, no swap, and the cluster runs fine besides there
 intermittent long pause that can cause a node to appear down to other
 nodes.  Any ideas as the cause of the long pause above? It seems not
 related to GC.

 thanks.


>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-01-29 Thread Benedict Elliott Smith
Frank,


The same advice for investigating holds: add the VM flags
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
(you could put something above 1 there, to reduce the amount of
logging, since a pause of 52s will be pretty obvious even if
aggregated with lots of other safe points; the count is the number of
safepoints to aggregate into one log message)


52s is a very extreme pause, and I would be surprised if revoke bias
could cause this. I wonder if the VM is swapping out.



On 29 January 2014 19:02, Frank Ng  wrote:

> Thanks for the update.  Our logs indicated that there were 0 pending for
> CompactionManager at that time.  Also, there were no nodetool repairs
> running at that time.  The log statements above state that the application
> had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
> the safepoint.
>
>
> On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
> shaochuan.w...@bloomreach.com> wrote:
>
>> We had similar latency spikes when pending compactions can't keep it up
>> or repair/streaming taking too much cycles.
>>
>>
>> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:
>>
>>> All,
>>>
>>> We've been having intermittent long application pauses (version 1.2.8)
>>> and not sure if it's a cassandra bug.  During these pauses, there are
>>> dropped messages in the cassandra log file along with the node seeing other
>>> nodes as down.  We've turned on gc logging and the following is an example
>>> of a long "stopped" or pause event in the gc.log file.
>>>
>>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>>> application threads were stopped: 0.091450 seconds
>>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>>> application threads were stopped: 51.8190260 seconds
>>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>>> application threads were stopped: 0.005470 seconds
>>>
>>> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
>>> pause.  There were no GC log events between those 2 log statements.  Since
>>> there's no GC logs in between, something else must be causing the long stop
>>> time to reach a safepoint.
>>>
>>> Could there be a Cassandra thread that is taking a long time to reach a
>>> safepoint and what is it trying to do? Along with the node seeing other
>>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>>> Pending in ReadStage and 9 Pending in MutationStage.
>>>
>>> There is mention of cassandra batch revoke bias locks as a possible
>>> cause (not GC) via:
>>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>>
>>> We have JNA, no swap, and the cluster runs fine besides there
>>> intermittent long pause that can cause a node to appear down to other
>>> nodes.  Any ideas as the cause of the long pause above? It seems not
>>> related to GC.
>>>
>>> thanks.
>>>
>>>
>>
>


Re: Intermittent long application pauses on nodes

2014-01-29 Thread Frank Ng
Thanks for the update.  Our logs indicated that there were 0 pending for
CompactionManager at that time.  Also, there were no nodetool repairs
running at that time.  The log statements above state that the application
had to stop to reach a safepoint.  Yet, it doesn't say what is requesting
the safepoint.


On Wed, Jan 29, 2014 at 1:20 PM, Shao-Chuan Wang <
shaochuan.w...@bloomreach.com> wrote:

> We had similar latency spikes when pending compactions can't keep it up or
> repair/streaming taking too much cycles.
>
>
> On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:
>
>> All,
>>
>> We've been having intermittent long application pauses (version 1.2.8)
>> and not sure if it's a cassandra bug.  During these pauses, there are
>> dropped messages in the cassandra log file along with the node seeing other
>> nodes as down.  We've turned on gc logging and the following is an example
>> of a long "stopped" or pause event in the gc.log file.
>>
>> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
>> application threads were stopped: 0.091450 seconds
>> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
>> application threads were stopped: 51.8190260 seconds
>> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
>> application threads were stopped: 0.005470 seconds
>>
>> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
>> pause.  There were no GC log events between those 2 log statements.  Since
>> there's no GC logs in between, something else must be causing the long stop
>> time to reach a safepoint.
>>
>> Could there be a Cassandra thread that is taking a long time to reach a
>> safepoint and what is it trying to do? Along with the node seeing other
>> nodes as down in the cassandra log file, the StatusLogger shows 1599
>> Pending in ReadStage and 9 Pending in MutationStage.
>>
>> There is mention of cassandra batch revoke bias locks as a possible cause
>> (not GC) via:
>> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>>
>> We have JNA, no swap, and the cluster runs fine besides there
>> intermittent long pause that can cause a node to appear down to other
>> nodes.  Any ideas as the cause of the long pause above? It seems not
>> related to GC.
>>
>> thanks.
>>
>>
>


Re: Intermittent long application pauses on nodes

2014-01-29 Thread Shao-Chuan Wang
We had similar latency spikes when pending compactions can't keep it up or
repair/streaming taking too much cycles.


On Wed, Jan 29, 2014 at 10:07 AM, Frank Ng  wrote:

> All,
>
> We've been having intermittent long application pauses (version 1.2.8) and
> not sure if it's a cassandra bug.  During these pauses, there are dropped
> messages in the cassandra log file along with the node seeing other nodes
> as down.  We've turned on gc logging and the following is an example of a
> long "stopped" or pause event in the gc.log file.
>
> 2014-01-28T23:11:12.183-0500: 1337654.424: Total time for which
> application threads were stopped: 0.091450 seconds
> 2014-01-28T23:14:11.161-0500: 1337833.401: Total time for which
> application threads were stopped: 51.8190260 seconds
> 2014-01-28T23:14:19.870-0500: 1337842.111: Total time for which
> application threads were stopped: 0.005470 seconds
>
> As seen above, there was a 0.091450 secs pause, then a 51.8190260 secs
> pause.  There were no GC log events between those 2 log statements.  Since
> there's no GC logs in between, something else must be causing the long stop
> time to reach a safepoint.
>
> Could there be a Cassandra thread that is taking a long time to reach a
> safepoint and what is it trying to do? Along with the node seeing other
> nodes as down in the cassandra log file, the StatusLogger shows 1599
> Pending in ReadStage and 9 Pending in MutationStage.
>
> There is mention of cassandra batch revoke bias locks as a possible cause
> (not GC) via:
> http://www.mail-archive.com/user@cassandra.apache.org/msg34401.html
>
> We have JNA, no swap, and the cluster runs fine besides there intermittent
> long pause that can cause a node to appear down to other nodes.  Any ideas
> as the cause of the long pause above? It seems not related to GC.
>
> thanks.
>
>