[
https://issues.apache.org/jira/browse/CASSANDRA-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044660#comment-18044660
]
Dmitry Konstantinov edited comment on CASSANDRA-20918 at 12/12/25 11:55 AM:
----------------------------------------------------------------------------
+1 from my side as well, [~nitsanw] thank you a lot for spending such amount of
time to deal with this old Cassandra pain point.
I've run a simple e2e write test (reused the one which I use for write/flushing
flow tuning in tickets like CASSANDRA-20226):
* Table structure:
** 5 String value cells
** 10 rows per partition
** 1-column String partition key
** 1-column String clustering keys
* m8i.4xlarge EC2 host
* 1 compaction thread, with compaction_throughput: 1024MiB/s to not be a
limiting factor
* 2 cases: -Dcassandra.cursor_compaction_enabled=false/true
* 2 runs were executed for each case: 1st is warmup, results are captured for
the 2nd
test table is truncated after every run
ttop (./bin/nodetool sjk ttop), vmstat (vmstat -w 1 -t), gc.log and system.log
are collected:
* [^compact_before_t1.zip]
* [^compact_after_t1.zip
For cursor_compaction an additional 3rd run is executed to capture async
profiler profiles (cpu,wall,mem) with enabled DebugNonSafepoints, as heatmaps:
[^compact_after_t1_profiles.zip]
I've got about 2x improvement for compaction throughput in my use case (we
discussed it with [~nitsanw] - my workload is VInt heavy, so the improvement
is slightly less than expected but still it is very impressive):
{code:java}
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p'
compact_before_t1/compact_before_1t_compacted_messages.log
356,455/s
410,164/s
362,033/s
411,265/s
518,364/s
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p'
compact_after_t1/compact_after_1t_compacted_messages.log
646,293/s
646,312/s
608,267/s
574,491/s
627,155/s
820,685/s
976,634/s
{code}
As expected, the heap allocation rate in compaction threads is dropped almost
to 0 and in the allocation profile it is less than 0.1% of all allocations.
ttop output:
{code:java}
BEFORE: [000131] user=96.74% sys= 2.65% alloc= 534mb/s - CompactionExecutor:2
AFTER: [000307] user=34.53% sys= 2.13% alloc= 308kb/s - CompactionExecutor:4
{code}
An interesting side effect: when compaction is faster we may get a higher
number of compactions in case of intensive write workload just because there is
a less time to batch SSTables while a compaction is running :)
Ideas for possible future improvements in other patches (definitely not for the
current one, it is more than enough and we need to finalize it) based on
profile analysis:
* SSTableCursorReader.CellCursor#readCellHeader is noticeable, it is
interesting why, is it a skid or a real issue..
* Cells content copy - probably a direct copy from a reader to a writer could
squeeze a bit more perf here..
* It looks like varint read/write is not very cheap, maybe we can save a bit
on encoding by using the encoded value from reader ...
* Allocation (the amounts are small, so I do not expect a lot of win here)
**
h[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.ChecksumWriter#appendDirect
- reuse ByteBuffer
**
[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.CompressedChunkReader#getCrcCheckChance
- use DoubleSupplier to avoid boxing
was (Author: dnk):
+1 from my side as well, [~nitsanw] thank you a lot for spending such amount of
time to deal with this old Cassandra pain point.
I've run a simple e2e write test (reused the one which I use for write/flushing
flow tuning in tickets like CASSANDRA-20226):
* Table structure:
** 5 String value cells
** 10 rows per partition
** 1-column String partition key
** 1-column String clustering keys
* m8i.4xlarge EC2 host
* 1 compaction thread, with compaction_throughput: 1024MiB/s to not be a
limiting factor
* 2 cases: -Dcassandra.cursor_compaction_enabled=false/true
* 2 runs were executed for each case: 1st is warmup, results are captured for
the 2nd
test table is truncated after every run
ttop (./bin/nodetool sjk ttop), vmstat (vmstat -w 1 -t), gc.log and system.log
are collected:
* [^compact_before_t1.zip]
* [^compact_after_t1.zip
for cursor_compaction an additional 3rd run is executed to capture async
profiler profiles (cpu,wall,mem) with enabled DebugNonSafepoints, as heatmaps:
[^compact_after_t1_profiles.zip]
I've got about 2x improvement for compaction throughput in my use case (we
discussed it with [~nitsanw] - in my workload is VInt heavy, so the
improvement is slightly less than expected but still is is very impressive):
{code:java}
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p'
compact_before_t1/compact_before_1t_compacted_messages.log
356,455/s
410,164/s
362,033/s
411,265/s
518,364/s
sed -n 's/.* Row Throughput = ~\([0-9][0-9,]*\/s\).*/\1/p'
compact_after_t1/compact_after_1t_compacted_messages.log
646,293/s
646,312/s
608,267/s
574,491/s
627,155/s
820,685/s
976,634/s
{code}
As expected, the heap allocation rate in compaction threads is dropped almost
to 0 and in profile it is less than 0.1% of all allocations.
ttop output:
{code:java}
BEFORE: [000131] user=96.74% sys= 2.65% alloc= 534mb/s - CompactionExecutor:2
AFTER: [000307] user=34.53% sys= 2.13% alloc= 308kb/s - CompactionExecutor:4
{code}
An interesting side effect: when compaction is faster we may get a higher
number of compactions in case of intensive write workload just because there is
a less time to batch SSTables while a compaction is running :)
Ideas for possible future improvements in other patches (definetely not for the
current one, it is more than enough and we need to finalize it) based on
profile analysis:
* SSTableCursorReader.CellCursor#readCellHeader is noticeable, it is
interesting why, is it a skid or a real issue..
* Cells content copy - probably a direct copy from a reader to a writer could
squeeze a bit more perf here..
* It looks like varint read/write is not very cheap, maybe we can save a bit
on encoding by using the encoded value from reader ...
* Allocation (the amounts are small, so I do not expect a lot of win here)
**
h[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.ChecksumWriter#appendDirect
- reuse ByteBuffer
**
[org.apache.cassandra.io|http://org.apache.cassandra.io/].util.CompressedChunkReader#getCrcCheckChance
- use DoubleSupplier to avoid boxing
> Add cursor-based low allocation optimized compaction implementation
> -------------------------------------------------------------------
>
> Key: CASSANDRA-20918
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20918
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Local/Compaction, Local/SSTable
> Reporter: Josh McKenzie
> Assignee: Nitsan Wakart
> Priority: Normal
> Attachments: 7_100m_100kr_100r.png, compact_after_t1.zip,
> compact_after_t1_profiles.zip, compact_before_t1.zip
>
> Time Spent: 5.5h
> Remaining Estimate: 0h
>
> Compaction does a ton of allocation and burns a lot of CPU in the process; we
> can move away from allocation with some fairly simple and straightforward
> reusable objects and infrastructure that make use of that, reducing
> allocation and thus CPU usage during compaction. Heap allocation on all
> test-cases holds steady at 20MB while regular compaction grows up past 5+GB.
> This patch introduces a collection of reusable objects:
> * ReusableLivenessInfo
> * ReusableDecoratedKey
> * ReusableLongToken
> And new compaction structures that make use of those objects:
> * CompactionCursor
> * CursorCompactionPipeline
> * SSTableCursorReader
> * SSTableCursorWriter
> There's quite a bit of test code added, benchmarks, etc on the linked branch.
> ~13k added, 405 lines deleted
> ~8.3k lines delta are non-test code
> ~5k lines delta are test code
> Attaching a screenshot of the "messiest" benchmark case with mixed size rows
> and full merge; across various data and compaction mixes the highlight is
> that compaction as implemented here is roughly 3-5x faster in most scenarios
> and uses 20mb on heap vs. multiple GB.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]