Help understanding aftermath of death by GC

2015-03-31 Thread Robert Wille
I moved my site over to Cassandra a few months ago, and everything has been 
just peachy until a few hours ago (yes, it would be in the middle of the night) 
when my entire cluster suffered death by GC. By death by GC, I mean this:

[rwille@cas031 cassandra]$ grep GC system.log | head -5
 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 
8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC 
for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 
8329887744

I’m pretty sure I know what triggered it. When I first started developing to 
Cassandra, I found the IN clause to be supremely useful, and I used it a lot. 
Later I figured out it was a bad thing and repented and fixed my code, but I 
missed one spot. A maintenance task spent a couple of hours repeatedly issuing 
queries with IN clauses with 1000 items in the clause and the whole system went 
belly up.

I get that my bad queries caused Cassandra to require more heap than was 
available, but here’s what I don’t understand. When the crap hit the fan, the 
maintenance task died due to a timeout error, but the cluster never recovered. 
I would have expected that when I was no longer issuing the bad queries, that 
the heap would get cleaned up and life would resume to normal. Can anybody help 
me understand why Cassandra wouldn’t recover? How is it that GC pressure will 
cause heap to be permanently uncollectable?

This makes me pretty worried. I can fix my code, but I don’t really have 
control over spikes. If memory pressure spikes, I can tolerate some timeouts 
and errors, but if it can’t come back when the pressure is gone, that seems 
pretty bad.

Any insights would be greatly appreciated

Robert




Re: Help understanding aftermath of death by GC

2015-03-31 Thread Jens Rantil
Hi Robert,

On Tue, Mar 31, 2015 at 2:22 PM, Robert Wille rwi...@fold3.com wrote:

 Can anybody help me understand why Cassandra wouldn’t recover?


One issue when you are running a JVM and start running out of memory is
that the JVM can start throwing `OutOfMemoryError` in any thread - not
necessarily in the thread which is taking all the memory. I've seen this
happen multiple times. If this happened to you, a critical Cassandra thread
could have died and brought the whole Cassandra DB with itself.

Just an idea - cheers,
Jens

-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: Help understanding aftermath of death by GC

2015-03-31 Thread Robert Coli
On Tue, Mar 31, 2015 at 9:12 AM, Jens Rantil jens.ran...@tink.se wrote:

 One issue when you are running a JVM and start running out of memory is
 that the JVM can start throwing `OutOfMemoryError` in any thread - not
 necessarily in the thread which is taking all the memory. I've seen this
 happen multiple times. If this happened to you, a critical Cassandra thread
 could have died and brought the whole Cassandra DB with itself.


Jens is correct that the JVM has few options as to what to do when it runs
out of heap :

https://issues.apache.org/jira/browse/CASSANDRA-7507

Expands a bit on Cassandra specific behavior here.

But basically, once you've OOMed (any generation of) the heap, you almost
certainly want to stop or re-start the JVM, even if it hasn't crashed
itself.

=Rob


Re: Help understanding aftermath of death by GC

2015-03-31 Thread Jason Wee
Hey Robert, you might want to start by looking into the statistics of
cassandra, either exposed via nodetool or if you have monitoring
system monitor the important metrics. I have read this article moment
ago and I hope it help you
http://aryanet.com/blog/cassandra-garbage-collector-tuning to begin to
understand where and how to determine the root cause.

jason

On Tue, Mar 31, 2015 at 8:22 PM, Robert Wille rwi...@fold3.com wrote:
 I moved my site over to Cassandra a few months ago, and everything has been 
 just peachy until a few hours ago (yes, it would be in the middle of the 
 night) when my entire cluster suffered death by GC. By death by GC, I mean 
 this:

 [rwille@cas031 cassandra]$ grep GC system.log | head -5
  INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) 
 GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max 
 is 8329887744
  INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) 
 GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max 
 is 8329887744
  INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) 
 GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max 
 is 8329887744
  INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) 
 GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max 
 is 8329887744
  INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) 
 GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max 
 is 8329887744

 I’m pretty sure I know what triggered it. When I first started developing to 
 Cassandra, I found the IN clause to be supremely useful, and I used it a lot. 
 Later I figured out it was a bad thing and repented and fixed my code, but I 
 missed one spot. A maintenance task spent a couple of hours repeatedly 
 issuing queries with IN clauses with 1000 items in the clause and the whole 
 system went belly up.

 I get that my bad queries caused Cassandra to require more heap than was 
 available, but here’s what I don’t understand. When the crap hit the fan, the 
 maintenance task died due to a timeout error, but the cluster never 
 recovered. I would have expected that when I was no longer issuing the bad 
 queries, that the heap would get cleaned up and life would resume to normal. 
 Can anybody help me understand why Cassandra wouldn’t recover? How is it that 
 GC pressure will cause heap to be permanently uncollectable?

 This makes me pretty worried. I can fix my code, but I don’t really have 
 control over spikes. If memory pressure spikes, I can tolerate some timeouts 
 and errors, but if it can’t come back when the pressure is gone, that seems 
 pretty bad.

 Any insights would be greatly appreciated

 Robert