[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759864#comment-17759864 ]
Branimir Lambov commented on CASSANDRA-18773: --------------------------------------------- {quote}This does copying of the rows into memory to pass across to the writer {quote} Partitions can have millions of rows, we cannot afford to copy something this big to memory. The current materialization point is the row and even that is a problem for some use cases. Processing the read in another thread adds a lot of synchronization overhead, changes the balance between request serving and compaction, and in a busy node can actually result in compaction failing to make progress. We have instead been focusing on making compaction more parallelizable (see e.g. UCS/CEP-26/CASSANDRA-18397). UCS can already parallelize major compactions when there is no overlap, and we can do even better if we provide the split points to the compaction process in advance (see CASSANDRA-18802). > Compactions are slow > -------------------- > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction > Reporter: Cameron Zemek > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: 18773.patch, compact-poc.patch, flamegraph.png, > stress.yaml > > Time Spent: 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org