Help understanding aftermath of death by GC
I moved my site over to Cassandra a few months ago, and everything has been just peachy until a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered death by GC. By death by GC, I mean this: [rwille@cas031 cassandra]$ grep GC system.log | head -5 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 8329887744 I’m pretty sure I know what triggered it. When I first started developing to Cassandra, I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the clause and the whole system went belly up. I get that my bad queries caused Cassandra to require more heap than was available, but here’s what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout error, but the cluster never recovered. I would have expected that when I was no longer issuing the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause heap to be permanently uncollectable? This makes me pretty worried. I can fix my code, but I don’t really have control over spikes. If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come back when the pressure is gone, that seems pretty bad. Any insights would be greatly appreciated Robert
Re: Help understanding aftermath of death by GC
Hi Robert, On Tue, Mar 31, 2015 at 2:22 PM, Robert Wille rwi...@fold3.com wrote: Can anybody help me understand why Cassandra wouldn’t recover? One issue when you are running a JVM and start running out of memory is that the JVM can start throwing `OutOfMemoryError` in any thread - not necessarily in the thread which is taking all the memory. I've seen this happen multiple times. If this happened to you, a critical Cassandra thread could have died and brought the whole Cassandra DB with itself. Just an idea - cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: Help understanding aftermath of death by GC
On Tue, Mar 31, 2015 at 9:12 AM, Jens Rantil jens.ran...@tink.se wrote: One issue when you are running a JVM and start running out of memory is that the JVM can start throwing `OutOfMemoryError` in any thread - not necessarily in the thread which is taking all the memory. I've seen this happen multiple times. If this happened to you, a critical Cassandra thread could have died and brought the whole Cassandra DB with itself. Jens is correct that the JVM has few options as to what to do when it runs out of heap : https://issues.apache.org/jira/browse/CASSANDRA-7507 Expands a bit on Cassandra specific behavior here. But basically, once you've OOMed (any generation of) the heap, you almost certainly want to stop or re-start the JVM, even if it hasn't crashed itself. =Rob
Re: Help understanding aftermath of death by GC
Hey Robert, you might want to start by looking into the statistics of cassandra, either exposed via nodetool or if you have monitoring system monitor the important metrics. I have read this article moment ago and I hope it help you http://aryanet.com/blog/cassandra-garbage-collector-tuning to begin to understand where and how to determine the root cause. jason On Tue, Mar 31, 2015 at 8:22 PM, Robert Wille rwi...@fold3.com wrote: I moved my site over to Cassandra a few months ago, and everything has been just peachy until a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered death by GC. By death by GC, I mean this: [rwille@cas031 cassandra]$ grep GC system.log | head -5 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30219 ms for 1 collections, 7664429440 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30673 ms for 1 collections, 7707488712 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30453 ms for 1 collections, 7693634672 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30691 ms for 1 collections, 7686028472 used; max is 8329887744 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30346 ms for 1 collections, 7701401200 used; max is 8329887744 I’m pretty sure I know what triggered it. When I first started developing to Cassandra, I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the clause and the whole system went belly up. I get that my bad queries caused Cassandra to require more heap than was available, but here’s what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout error, but the cluster never recovered. I would have expected that when I was no longer issuing the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause heap to be permanently uncollectable? This makes me pretty worried. I can fix my code, but I don’t really have control over spikes. If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come back when the pressure is gone, that seems pretty bad. Any insights would be greatly appreciated Robert