Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
--------------------------------------------------------------------

                 Key: CASSANDRA-2463
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2463
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.7.4
         Environment: Any
            Reporter: C. Scott Andreas
             Fix For: 0.7.4


Currently, Cassandra 0.7.x allocates a 256MB contiguous byte array at the 
beginning of a memtable flush or compaction (presently hard-coded as 
Config.in_memory_compaction_limit_in_mb). When several memtable flushes are 
triggered at once (as by `nodetool flush` or `nodetool snapshot`), the tenured 
generation will typically experience extreme pressure as it attempts to locate 
[n] contiguous 256mb chunks of heap to allocate. This will often trigger a 
promotion failure, resulting in a stop-the-world GC until the allocation can be 
made. (Note that in the case of the "release valve" being triggered, the 
problem is even further exacerbated; the release valve will ironically trigger 
two contiguous 256MB allocations when attempting to flush the two largest 
memtables).

This patch sets the buffer to be used by BufferedRandomAccessFile to 
Math.min(bytesToWrite, BufferedRandomAccessFile.DEFAULT_BUFFER_SIZE) rather 
than a hard-coded 256MB. The typical resulting buffer size is 64kb.

I've taken some time to measure the impact of this change on the base 0.7.4 
release and with this patch applied. This test involved launching Cassandra, 
performing four million writes across three column families from three clients, 
and monitoring heap usage and garbage collections. Cassandra was launched with 
2GB of heap and the default JVM options shipped with the project. This 
configuration has 7 column families with a total of 15GB of data.

Here's the base 0.7.4 release:
http://cl.ly/413g2K06121z252e2t10

Note that on launch, we see a flush + compaction triggered almost immediately, 
resulting in at least 7x very quick 256MB allocations maxing out the heap, 
resulting in a promotion failure and a full GC. As flushes proceeed, we see 
that most of these have a corresponding CMS, consistent with the pattern of a 
large allocation and immediate collection. We see a second promotion failure 
and full GC at the 75% mark as the allocations cannot be satisfied without a 
collection, along with several CMSs in between. In the failure cases, the 
allocation requests occur so quickly that a standard CMS phase cannot completed 
before a ParNew attempts to promote the surviving byte array into the tenured 
generation. The heap usage and GC profile of this graph is very unhealthy.

Here's the 0.7.4 release with this patch applied:
http://cl.ly/050I1g26401B1X0w3s1f

This graph is very different. At launch, rather than a immediate spike to full 
allocation and a promotion failure, we see a slow allocation slope reaching 
only 1/8th of total heap size. As writes begin, we see several flushes and 
compactions, but none result in immediate, large allocations. The ParNew 
collector keeps up with collections far more ably, resulting in only one 
healthy CMS collection with no promotion failure. Unlike the unhealthy rapid 
allocation and massive collection pattern we see in the first graph, this graph 
depicts a healthy sawtooth pattern of ParNews and an occasional effective CMS 
with no danger of heap fragmentation resulting in a promotion failure.

The bottom line is that there's no need to allocate a hard-coded 256MB write 
buffer for flushing memtables and compactions to disk. Doing so results in 
unhealthy rapid allocation patterns and increases the probability of triggering 
promotion failures and full stop-the-world GCs which can cause nodes to become 
unresponsive and shunned from the ring during flushes and compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to