[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759820#comment-17759820 ]
Cameron Zemek commented on CASSANDRA-18773: ------------------------------------------- [^18773.patch] I took your idea above and implemented a preserveOrder method onto MergeIterator which CompactionIterator implementation will disable when there is no index. {code:java} INFO [CompactionExecutor:2] 2023-08-28 22:19:37,162 CompactionTask.java:239 - Read=53.93% 7.03 MiB/s, Write=20.47% 7.31 MiB/s INFO [CompactionExecutor:2] 2023-08-28 22:20:37,162 CompactionTask.java:239 - Read=54.94% 6.97 MiB/s, Write=20.42% 7.24 MiB/s INFO [CompactionExecutor:2] 2023-08-28 22:21:37,162 CompactionTask.java:239 - Read=53.69% 6.82 MiB/s, Write=22.33% 7.08 MiB/s {code} Which results in basically same results as my proof of concept. [~blambov] what do you think about using background threads in compactions (to decouple read/write)? As that change also results in noticeable increase (40%) to: {noformat} INFO [CompactionExecutor:2] 2023-08-28 21:08:08,463 CompactionTask.java:266 - Read=37.27% 9.63 MiB/s, Write=28.22% 10 MiB/s INFO [CompactionExecutor:2] 2023-08-28 21:09:08,463 CompactionTask.java:266 - Read=37.93% 9.65 MiB/s, Write=27.87% 10.02 MiB/s{noformat} This does copying of the rows into memory to pass across to the writer, so the reader can progress its file positions. Eg. {code:java} ArrayList<Unfiltered> rows = new ArrayList<>(); while (rowIterator.hasNext()) { rows.add(rowIterator.next()); }{code} So there is a tradeoff. > Compactions are slow > -------------------- > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction > Reporter: Cameron Zemek > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: 18773.patch, compact-poc.patch, flamegraph.png, > stress.yaml > > Time Spent: 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org