[
https://issues.apache.org/jira/browse/CASSANDRA-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Ellis updated CASSANDRA-2463:
--------------------------------------
Attachment: 2463-v2.txt
I started making it more complicated:
{code}
// the gymnastics here are because
// - we want the buffer large enough that we're not re-buffering when
we have to seek back to the
// start of a row to write the data size. Here, "10% larger than
the average row" is "large enough,"
// meaning we expect to seek and rebuffer about 1/10 of the time.
// - but we don't want to allocate a huge buffer unnecessarily for a
small amount of data
// - and on the low end, we don't want to be absurdly stingy with the
buffer size for small rows
assert estimatedSize > 0;
long maxBufferSize =
Math.min(DatabaseDescriptor.getInMemoryCompactionLimit(), 1024 * 1024);
int bufferSize;
if (estimatedSize < 64 * 1024)
{
bufferSize = (int) estimatedSize;
}
else
{
long estimatedRowSize = estimatedSize / keyCount;
bufferSize = (int) Math.min(Math.max(1.1 * estimatedRowSize, 64 *
1024), maxBufferSize);
}
{code}
... but the larger our buffer is, the larger the penalty for guessing wrong
when we have to seek back and rebuffer.
Then I went through and added size estimation to the CompactionManager, until I
thought "it's kind of ridiculous to be worrying about saving a few bytes less
than 64KB, especially when we expect most memtables to have more data in them
than 64K when flushed."
Thus, I arrived at the patch Antoine de Saint-Exupery would have written,
attached as v2.
> Flush and Compaction Unnecessarily Allocate 256MB Contiguous Buffers
> --------------------------------------------------------------------
>
> Key: CASSANDRA-2463
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2463
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.7.4
> Environment: Any
> Reporter: C. Scott Andreas
> Labels: patch
> Fix For: 0.7.4
>
> Attachments: 2463-v2.txt, patch.diff
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Currently, Cassandra 0.7.x allocates a 256MB contiguous byte array at the
> beginning of a memtable flush or compaction (presently hard-coded as
> Config.in_memory_compaction_limit_in_mb). When several memtable flushes are
> triggered at once (as by `nodetool flush` or `nodetool snapshot`), the
> tenured generation will typically experience extreme pressure as it attempts
> to locate [n] contiguous 256mb chunks of heap to allocate. This will often
> trigger a promotion failure, resulting in a stop-the-world GC until the
> allocation can be made. (Note that in the case of the "release valve" being
> triggered, the problem is even further exacerbated; the release valve will
> ironically trigger two contiguous 256MB allocations when attempting to flush
> the two largest memtables).
> This patch sets the buffer to be used by BufferedRandomAccessFile to
> Math.min(bytesToWrite, BufferedRandomAccessFile.DEFAULT_BUFFER_SIZE) rather
> than a hard-coded 256MB. The typical resulting buffer size is 64kb.
> I've taken some time to measure the impact of this change on the base 0.7.4
> release and with this patch applied. This test involved launching Cassandra,
> performing four million writes across three column families from three
> clients, and monitoring heap usage and garbage collections. Cassandra was
> launched with 2GB of heap and the default JVM options shipped with the
> project. This configuration has 7 column families with a total of 15GB of
> data.
> Here's the base 0.7.4 release:
> http://cl.ly/413g2K06121z252e2t10
> Note that on launch, we see a flush + compaction triggered almost
> immediately, resulting in at least 7x very quick 256MB allocations maxing out
> the heap, resulting in a promotion failure and a full GC. As flushes
> proceeed, we see that most of these have a corresponding CMS, consistent with
> the pattern of a large allocation and immediate collection. We see a second
> promotion failure and full GC at the 75% mark as the allocations cannot be
> satisfied without a collection, along with several CMSs in between. In the
> failure cases, the allocation requests occur so quickly that a standard CMS
> phase cannot completed before a ParNew attempts to promote the surviving byte
> array into the tenured generation. The heap usage and GC profile of this
> graph is very unhealthy.
> Here's the 0.7.4 release with this patch applied:
> http://cl.ly/050I1g26401B1X0w3s1f
> This graph is very different. At launch, rather than a immediate spike to
> full allocation and a promotion failure, we see a slow allocation slope
> reaching only 1/8th of total heap size. As writes begin, we see several
> flushes and compactions, but none result in immediate, large allocations. The
> ParNew collector keeps up with collections far more ably, resulting in only
> one healthy CMS collection with no promotion failure. Unlike the unhealthy
> rapid allocation and massive collection pattern we see in the first graph,
> this graph depicts a healthy sawtooth pattern of ParNews and an occasional
> effective CMS with no danger of heap fragmentation resulting in a promotion
> failure.
> The bottom line is that there's no need to allocate a hard-coded 256MB write
> buffer for flushing memtables and compactions to disk. Doing so results in
> unhealthy rapid allocation patterns and increases the probability of
> triggering promotion failures and full stop-the-world GCs which can cause
> nodes to become unresponsive and shunned from the ring during flushes and
> compactions.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira