[ 
https://issues.apache.org/jira/browse/CASSANDRA-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-2463:
--------------------------------------

    Attachment: 2463-v2.txt

I started making it more complicated:

{code}
        // the gymnastics here are because
        //  - we want the buffer large enough that we're not re-buffering when 
we have to seek back to the
        //    start of a row to write the data size.  Here, "10% larger than 
the average row" is "large enough,"
        //    meaning we expect to seek and rebuffer about 1/10 of the time.
        //  - but we don't want to allocate a huge buffer unnecessarily for a 
small amount of data
        //  - and on the low end, we don't want to be absurdly stingy with the 
buffer size for small rows
        assert estimatedSize > 0;
        long maxBufferSize = 
Math.min(DatabaseDescriptor.getInMemoryCompactionLimit(), 1024 * 1024);
        int bufferSize;
        if (estimatedSize < 64 * 1024)
        {
            bufferSize = (int) estimatedSize;
        }
        else
        {
            long estimatedRowSize = estimatedSize / keyCount;
            bufferSize = (int) Math.min(Math.max(1.1 * estimatedRowSize, 64 * 
1024), maxBufferSize);
        }
{code}

...  but the larger our buffer is, the larger the penalty for guessing wrong 
when we have to seek back and rebuffer.

Then I went through and added size estimation to the CompactionManager, until I 
thought "it's kind of ridiculous to be worrying about saving a few bytes less 
than 64KB, especially when we expect most memtables to have more data in them 
than 64K when flushed."

Thus, I arrived at the patch Antoine de Saint-Exupery would have written, 
attached as v2.

> Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-2463
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2463
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.4
>         Environment: Any
>            Reporter: C. Scott Andreas
>              Labels: patch
>             Fix For: 0.7.4
>
>         Attachments: 2463-v2.txt, patch.diff
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, Cassandra 0.7.x allocates a 256MB contiguous byte array at the 
> beginning of a memtable flush or compaction (presently hard-coded as 
> Config.in_memory_compaction_limit_in_mb). When several memtable flushes are 
> triggered at once (as by `nodetool flush` or `nodetool snapshot`), the 
> tenured generation will typically experience extreme pressure as it attempts 
> to locate [n] contiguous 256mb chunks of heap to allocate. This will often 
> trigger a promotion failure, resulting in a stop-the-world GC until the 
> allocation can be made. (Note that in the case of the "release valve" being 
> triggered, the problem is even further exacerbated; the release valve will 
> ironically trigger two contiguous 256MB allocations when attempting to flush 
> the two largest memtables).
> This patch sets the buffer to be used by BufferedRandomAccessFile to 
> Math.min(bytesToWrite, BufferedRandomAccessFile.DEFAULT_BUFFER_SIZE) rather 
> than a hard-coded 256MB. The typical resulting buffer size is 64kb.
> I've taken some time to measure the impact of this change on the base 0.7.4 
> release and with this patch applied. This test involved launching Cassandra, 
> performing four million writes across three column families from three 
> clients, and monitoring heap usage and garbage collections. Cassandra was 
> launched with 2GB of heap and the default JVM options shipped with the 
> project. This configuration has 7 column families with a total of 15GB of 
> data.
> Here's the base 0.7.4 release:
> http://cl.ly/413g2K06121z252e2t10
> Note that on launch, we see a flush + compaction triggered almost 
> immediately, resulting in at least 7x very quick 256MB allocations maxing out 
> the heap, resulting in a promotion failure and a full GC. As flushes 
> proceeed, we see that most of these have a corresponding CMS, consistent with 
> the pattern of a large allocation and immediate collection. We see a second 
> promotion failure and full GC at the 75% mark as the allocations cannot be 
> satisfied without a collection, along with several CMSs in between. In the 
> failure cases, the allocation requests occur so quickly that a standard CMS 
> phase cannot completed before a ParNew attempts to promote the surviving byte 
> array into the tenured generation. The heap usage and GC profile of this 
> graph is very unhealthy.
> Here's the 0.7.4 release with this patch applied:
> http://cl.ly/050I1g26401B1X0w3s1f
> This graph is very different. At launch, rather than a immediate spike to 
> full allocation and a promotion failure, we see a slow allocation slope 
> reaching only 1/8th of total heap size. As writes begin, we see several 
> flushes and compactions, but none result in immediate, large allocations. The 
> ParNew collector keeps up with collections far more ably, resulting in only 
> one healthy CMS collection with no promotion failure. Unlike the unhealthy 
> rapid allocation and massive collection pattern we see in the first graph, 
> this graph depicts a healthy sawtooth pattern of ParNews and an occasional 
> effective CMS with no danger of heap fragmentation resulting in a promotion 
> failure.
> The bottom line is that there's no need to allocate a hard-coded 256MB write 
> buffer for flushing memtables and compactions to disk. Doing so results in 
> unhealthy rapid allocation patterns and increases the probability of 
> triggering promotion failures and full stop-the-world GCs which can cause 
> nodes to become unresponsive and shunned from the ring during flushes and 
> compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to