[
https://issues.apache.org/jira/browse/CASSANDRA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079536#comment-13079536
]
Jonathan Ellis commented on CASSANDRA-2901:
-------------------------------------------
I added some debug logging that shows that it's actually including extra
columns in the first pass.
[pass 1]
{noformat}
...
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,056 LazilyCompactedRow.java
(line 225) added 16481 to serializedSize for 2ab319d0beba11e00000fe8ebeead9cb
[the next are bogus]
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,056 LazilyCompactedRow.java
(line 225) added 17075 to serializedSize for 2acf0640beba11e00000fe8ebeead9cb
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,056 LazilyCompactedRow.java
(line 225) added 17585 to serializedSize for 2af15b50beba11e00000fe8ebeead9cb
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,056 LazilyCompactedRow.java
(line 225) added 17596 to serializedSize for 2af8fc70beba11e00000fe8ebeead9cb
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,057 LazilyCompactedRow.java
(line 225) added 17493 to serializedSize for 2b0335a0beba11e00000fe8ebeead9cb
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,057 LazilyCompactedRow.java
(line 225) added 17493 to serializedSize for 2b200c70beba11e00000fe8ebeead9cb
{noformat}
[pass 2]
{noformat}
...
DEBUG [CompactionExecutor:1] 2011-08-04 11:52:42,088 LazilyCompactedRow.java
(line 225) added 16481 to serializedSize for 2ab319d0beba11e00000fe8ebeead9cb
{noformat}
Baffling.
> Allow taking advantage of multiple cores while compacting a single CF
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-2901
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2901
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Jonathan Ellis
> Priority: Minor
> Fix For: 0.8.4
>
> Attachments:
> 0001-fix-tracker-getting-out-of-sync-with-underlying-data-s.txt,
> 0002-parallel-compaction.txt, 0003-Fix-LCR.patch
>
>
> Moved from CASSANDRA-1876:
> There are five stages: read, deserialize, merge, serialize, and write. We
> probably want to continue doing read+deserialize and serialize+write
> together, or you waste a lot copying to/from buffers.
> So, what I would suggest is: one thread per input sstable doing read +
> deserialize (a row at a time). A thread pool (one per core?) merging
> corresponding rows from each input sstable. One thread doing serialize +
> writing the output (this has to wait for the merge threads to complete
> in-order, obviously). This should take us from being CPU bound on SSDs (since
> only one core is compacting) to being I/O bound.
> This will require roughly 2x the memory, to allow the reader threads to work
> ahead of the merge stage. (I.e. for each input sstable you will have up to
> one row in a queue waiting to be merged, and the reader thread working on the
> next.) Seems quite reasonable on that front. You'll also want a small queue
> size for the serialize-merged-rows executor.
> Multithreaded compaction should be either on or off. It doesn't make sense to
> try to do things halfway (by doing the reads with a
> threadpool whose size you can grow/shrink, for instance): we still have
> compaction threads tuned to low priority, by default, so the impact on the
> rest of the system won't be very different. Nor do we expect to have so many
> input sstables that we lose a lot in context switching between reader threads.
> IMO it's acceptable to punt completely on rows that are larger than memory,
> and fall back to the old non-parallel code there. I don't see any sane way to
> parallelize large-row compactions.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira