amorton created CASSANDRA-10766:
-----------------------------------

             Summary: OOM from 22000 memtables 
                 Key: CASSANDRA-10766
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10766
             Project: Cassandra
          Issue Type: Bug
         Environment: * 32 cores, physical machines
* 8 SSD drives in raid 0
* hotspot java 8.60 
* Linux version 3.8.13-98.el6uek.x86_64 (mockbuild@x86-ol6-builder-04) (gcc 
version 4.4.7 20120313 (Red Hat 4.4.7-11)
            Reporter: amorton
         Attachments: MemtableFlushWriter.txt, SlabPoolCleaner.txt, 
lots-of-memtables.png, thread-dump-flush.txt, thread-dump-full.txt

Hi folks, we observed this on a new cassandra 2.1.11 cluster. 

The node ran out of memory, and when it did it had 22,000 + memtables in memory 
with a corresponding number of pending MemtableFlushWriter and 
MemtablePostFlushWriter tasks. We have seen other nodes going OOM, but have not 
looked at them as deeply as this one.

We also noticed that while there was 8 flush writers configured, there were 
times when only one (or maybe two) were making progress. And that there was 
times when the SlabPoolCleaner was flushing memtables 50 times a second. 

I'll try to put as much info below that makes sense. I have the full system 
logs and a heap dump and can go through them with someone if needed. Please 
ping me on IRC or the usual email if I am not on. 

Below is part of the Heap Dump showing the huge amount of memtables in ram. 

I have also attached copied of the same thread dump:

* thread-dump-full.txt is the full dump, node that is includes approximately 
300 shared worker threads. 
* thread-dump-flush.txt is the MemtableFlushWriter and MemtablePostFlush 
threads. 

Note that all the MemtableFlushWriter threads are trying to update the 
DataTracker, and that the MemtablePostFlush is waiting on the countdown latch 
to be set. The PostFlush object in the MemtablePostFlush thread is associated 
with the MemtableFlushWriter:744 thread. 

Looking at the MemtableFlushWriter threads they are trying to flush the 
following tables:

{noformat}
java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:750" 266120 120
system.hints

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:744" 223864 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:752" 220848 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:754" 219736 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:749" 214336 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:753" 211632 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:748" 149616 120
system.sstable_activity

java.lang.Thread [Thread, Stack Local] "MemtableFlushWriter:751" 143480 120
system.sstable_activity
{noformat}

When I saw that these were all on the same table I was wondering if the CAS in 
{{DataTracker.replaceFlushed()}} was starving some threads. 

The attached SlabPoolCleaner.txt file is the lines from the SlabPoolCleaner 
thread from a couple of system.log files. Notes:

* The machine went OOM at 08:38:25,463 server time. 
* The server was flushing user tables between 100 and 200 MB until approx 
2015-11-22 07:40:39,669 server time.
* After that the server was flushing files that were 10's of bytes. 


The attached MemtableFlushWriter.txt includes log lines from the 
MemtableFlushWriter threads. Notes:

* from approx 2015-11-22 07:41:18 it often looks like only one 
MemtableFlushWriter thread progresses at a time. 

A few other notes:

* This machine has logged a few dozen LEAK DETECTED messages. 
* This has logged messages about corrupted SSTables. 

At this point I am unsure if the SlabPoolCleaner was reacting to low memory and 
correctly flushing, or it was incorrectly flushing a lot which overloaded the 
flush system. We are also unsure what role the CAS in 
{{DataTracker.replaceFlushed()}} played, all the MemtableFlushWriter threads 
with the same progress seems odd.

As a test for the CAS theory we ran the node with {{memtable_flush_writers}} 
set to 2, it has been running for approx 12 hours under load and has not OOM'd 
yet. This is a good sign. 

Any thoughts on what could be happening ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to