[ 
https://issues.apache.org/jira/browse/CASSANDRA-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

C. Scott Andreas updated CASSANDRA-2463:
----------------------------------------

    Attachment: patch.diff

Patch attached. Applies cleanly to tag 'cassandra-0.7.4'. All tests pass.

> Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-2463
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2463
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: Any
>            Reporter: C. Scott Andreas
>              Labels: patch
>             Fix For: 0.7.4
>
>         Attachments: patch.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, Cassandra 0.7.x allocates a 256MB contiguous byte array at the 
> beginning of a memtable flush or compaction (presently hard-coded as 
> Config.in_memory_compaction_limit_in_mb). When several memtable flushes are 
> triggered at once (as by `nodetool flush` or `nodetool snapshot`), the 
> tenured generation will typically experience extreme pressure as it attempts 
> to locate [n] contiguous 256mb chunks of heap to allocate. This will often 
> trigger a promotion failure, resulting in a stop-the-world GC until the 
> allocation can be made. (Note that in the case of the "release valve" being 
> triggered, the problem is even further exacerbated; the release valve will 
> ironically trigger two contiguous 256MB allocations when attempting to flush 
> the two largest memtables).
> This patch sets the buffer to be used by BufferedRandomAccessFile to 
> Math.min(bytesToWrite, BufferedRandomAccessFile.DEFAULT_BUFFER_SIZE) rather 
> than a hard-coded 256MB. The typical resulting buffer size is 64kb.
> I've taken some time to measure the impact of this change on the base 0.7.4 
> release and with this patch applied. This test involved launching Cassandra, 
> performing four million writes across three column families from three 
> clients, and monitoring heap usage and garbage collections. Cassandra was 
> launched with 2GB of heap and the default JVM options shipped with the 
> project. This configuration has 7 column families with a total of 15GB of 
> data.
> Here's the base 0.7.4 release:
> http://cl.ly/413g2K06121z252e2t10
> Note that on launch, we see a flush + compaction triggered almost 
> immediately, resulting in at least 7x very quick 256MB allocations maxing out 
> the heap, resulting in a promotion failure and a full GC. As flushes 
> proceeed, we see that most of these have a corresponding CMS, consistent with 
> the pattern of a large allocation and immediate collection. We see a second 
> promotion failure and full GC at the 75% mark as the allocations cannot be 
> satisfied without a collection, along with several CMSs in between. In the 
> failure cases, the allocation requests occur so quickly that a standard CMS 
> phase cannot completed before a ParNew attempts to promote the surviving byte 
> array into the tenured generation. The heap usage and GC profile of this 
> graph is very unhealthy.
> Here's the 0.7.4 release with this patch applied:
> http://cl.ly/050I1g26401B1X0w3s1f
> This graph is very different. At launch, rather than a immediate spike to 
> full allocation and a promotion failure, we see a slow allocation slope 
> reaching only 1/8th of total heap size. As writes begin, we see several 
> flushes and compactions, but none result in immediate, large allocations. The 
> ParNew collector keeps up with collections far more ably, resulting in only 
> one healthy CMS collection with no promotion failure. Unlike the unhealthy 
> rapid allocation and massive collection pattern we see in the first graph, 
> this graph depicts a healthy sawtooth pattern of ParNews and an occasional 
> effective CMS with no danger of heap fragmentation resulting in a promotion 
> failure.
> The bottom line is that there's no need to allocate a hard-coded 256MB write 
> buffer for flushing memtables and compactions to disk. Doing so results in 
> unhealthy rapid allocation patterns and increases the probability of 
> triggering promotion failures and full stop-the-world GCs which can cause 
> nodes to become unresponsive and shunned from the ring during flushes and 
> compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to