[ 
https://issues.apache.org/jira/browse/CASSANDRA-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044660#comment-18044660
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-20918 at 12/12/25 11:55 AM:
----------------------------------------------------------------------------

+1 from my side as well, [~nitsanw] thank you a lot for spending such amount of 
time to deal with this old Cassandra pain point.

I've run a simple e2e write test (reused the one which I use for write/flushing 
flow tuning in tickets like CASSANDRA-20226):
 * Table structure:
 ** 5 String value cells
 ** 10 rows per partition
 ** 1-column String partition key
 ** 1-column String clustering keys
 * m8i.4xlarge EC2 host
 * 1 compaction thread, with compaction_throughput: 1024MiB/s to not be a 
limiting factor
 * 2 cases: -Dcassandra.cursor_compaction_enabled=false/true
 * 2 runs were executed for each case: 1st is warmup, results are captured for 
the 2nd
test table is truncated after every run

ttop (./bin/nodetool sjk ttop), vmstat (vmstat -w 1 -t), gc.log and system.log 
are collected:
 * [^compact_before_t1.zip]
 * [^compact_after_t1.zip

For cursor_compaction an additional 3rd run is executed to capture async 
profiler profiles (cpu,wall,mem) with enabled DebugNonSafepoints, as heatmaps: 
[^compact_after_t1_profiles.zip] 
I've got about 2x improvement for compaction throughput in my use case (we 
discussed it with [~nitsanw]  - my workload is VInt heavy, so the improvement 
is slightly less than expected but still it is very impressive):
{code:java}
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p' 
compact_before_t1/compact_before_1t_compacted_messages.log 
356,455/s
410,164/s
362,033/s
411,265/s
518,364/s

sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p' 
compact_after_t1/compact_after_1t_compacted_messages.log 
646,293/s
646,312/s
608,267/s
574,491/s
627,155/s
820,685/s
976,634/s
{code}
As expected, the heap allocation rate in compaction threads is dropped almost 
to 0 and in the allocation profile it is less than 0.1% of all allocations.

ttop output: 
{code:java}
BEFORE: [000131] user=96.74% sys= 2.65% alloc=  534mb/s - CompactionExecutor:2
AFTER:  [000307] user=34.53% sys= 2.13% alloc=  308kb/s - CompactionExecutor:4
{code}
An interesting side effect: when compaction is faster we may get a higher 
number of compactions in case of intensive write workload just because there is 
a less time to batch SSTables while a compaction is running :)

Ideas for possible future improvements in other patches (definitely not for the 
current one, it is more than enough and we need to finalize it) based on 
profile analysis:
 * SSTableCursorReader.CellCursor#readCellHeader is noticeable, it is 
interesting why, is it a skid or a real issue..
 * Cells content copy - probably a direct copy from a reader to a writer could 
squeeze a bit more perf here..
 * It looks like varint read/write is not very cheap, maybe we can save a bit 
on encoding by using the encoded value from reader ...
 * Allocation (the amounts are small, so I do not expect a lot of win here)
 ** 
h[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.ChecksumWriter#appendDirect
 - reuse ByteBuffer
 ** 
[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.CompressedChunkReader#getCrcCheckChance
 - use DoubleSupplier to avoid boxing


was (Author: dnk):
+1 from my side as well, [~nitsanw] thank you a lot for spending such amount of 
time to deal with this old Cassandra pain point.

I've run a simple e2e write test (reused the one which I use for write/flushing 
flow tuning in tickets like CASSANDRA-20226):
 * Table structure:
 ** 5 String value cells
 ** 10 rows per partition
 ** 1-column String partition key
 ** 1-column String clustering keys
 * m8i.4xlarge EC2 host
 * 1 compaction thread, with compaction_throughput: 1024MiB/s to not be a 
limiting factor
 * 2 cases: -Dcassandra.cursor_compaction_enabled=false/true
 * 2 runs were executed for each case: 1st is warmup, results are captured for 
the 2nd
test table is truncated after every run

ttop (./bin/nodetool sjk ttop), vmstat (vmstat -w 1 -t), gc.log and system.log 
are collected:
 * [^compact_before_t1.zip]
 * [^compact_after_t1.zip

 for cursor_compaction an additional 3rd run is executed to capture async 
profiler profiles (cpu,wall,mem) with enabled DebugNonSafepoints, as heatmaps: 
[^compact_after_t1_profiles.zip] 
I've got about 2x improvement for compaction throughput in my use case (we 
discussed it with [~nitsanw]  - in my workload is VInt heavy, so the 
improvement is slightly less than expected but still is is very impressive):
{code:java}
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p' 
compact_before_t1/compact_before_1t_compacted_messages.log 
356,455/s
410,164/s
362,033/s
411,265/s
518,364/s

sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p' 
compact_after_t1/compact_after_1t_compacted_messages.log 
646,293/s
646,312/s
608,267/s
574,491/s
627,155/s
820,685/s
976,634/s
{code}
As expected, the heap allocation rate in compaction threads is dropped almost 
to 0 and in profile it is less than 0.1% of all allocations.

ttop output: 
{code:java}
BEFORE: [000131] user=96.74% sys= 2.65% alloc=  534mb/s - CompactionExecutor:2
AFTER:  [000307] user=34.53% sys= 2.13% alloc=  308kb/s - CompactionExecutor:4
{code}
An interesting side effect: when compaction is faster we may get a higher 
number of compactions in case of intensive write workload just because there is 
a less time to batch SSTables while a compaction is running :)

Ideas for possible future improvements in other patches (definetely not for the 
current one, it is more than enough and we need to finalize it) based on 
profile analysis:
 * SSTableCursorReader.CellCursor#readCellHeader is noticeable, it is 
interesting why, is it a skid or a real issue..
 * Cells content copy - probably a direct copy from a reader to a writer could 
squeeze a bit more perf here..
 * It looks like varint read/write is not very cheap, maybe we can save a bit 
on encoding by using the encoded value from reader ...
 * Allocation (the amounts are small, so I do not expect a lot of win here)
 ** 
h[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.ChecksumWriter#appendDirect
 - reuse ByteBuffer
 ** 
[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.CompressedChunkReader#getCrcCheckChance
 - use DoubleSupplier to avoid boxing

> Add cursor-based low allocation optimized compaction implementation
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-20918
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20918
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Local/Compaction, Local/SSTable
>            Reporter: Josh McKenzie
>            Assignee: Nitsan Wakart
>            Priority: Normal
>         Attachments: 7_100m_100kr_100r.png, compact_after_t1.zip, 
> compact_after_t1_profiles.zip, compact_before_t1.zip
>
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Compaction does a ton of allocation and burns a lot of CPU in the process; we 
> can move away from allocation with some fairly simple and straightforward 
> reusable objects and infrastructure that make use of that, reducing 
> allocation and thus CPU usage during compaction. Heap allocation on all 
> test-cases holds steady at 20MB while regular compaction grows up past 5+GB.
> This patch introduces a collection of reusable objects:
>  * ReusableLivenessInfo
>  * ReusableDecoratedKey
>  * ReusableLongToken
> And new compaction structures that make use of those objects:
>  * CompactionCursor
>  * CursorCompactionPipeline
>  * SSTableCursorReader
>  * SSTableCursorWriter
> There's quite a bit of test code added, benchmarks, etc on the linked branch.
> ~13k added, 405 lines deleted
> ~8.3k lines delta are non-test code
> ~5k lines delta are test code
> Attaching a screenshot of the "messiest" benchmark case with mixed size rows 
> and full merge; across various data and compaction mixes the highlight is 
> that compaction as implemented here is roughly 3-5x faster in most scenarios 
> and uses 20mb on heap vs. multiple GB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to