[jira] [Commented] (CASSANDRA-2901) Allow taking advantage of multiple cores while compacting a single CF

Jonathan Ellis (JIRA) Wed, 27 Jul 2011 20:24:04 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072156#comment-13072156
 ]


Jonathan Ellis commented on CASSANDRA-2901:
-------------------------------------------

Thanks, Yewei!

Comments:

- I think we can simplify the "wait for row to be merged" logic by noting that 
CompactionTask is itself single-threaded.  So I'd have PCI.next return an 
AbstractCompactedRow subclass--FutureACR?--that knows how to wait for the merge 
to finish.  Then we don't need any special logic in PCI itself, we can just 
pull rows-being-merged off in order and leave the blocking for the merge to 
finish, to CompactionTask.
- "ReaderThread" multithreads the merges but it looks like reading the source 
sstables is still single-threaded (per merge).  Somehow we need to get the 
PrecompactedRow row.getColumnFamilyWithColumns call in its own thread.  Again I 
like the SSTII wrapper that uses a Future to pull the data from a task on a 
(per-source-sstable) executor pattern here, but I'm sure there are other 
options.  (Be careful to let LazilyCR tasks stay single-threaded, though.)
- I don't see the reason to have two different sentinel conditions, why not 
just use NO_ROW in both cases?
- Note on style: better to name the things you run on executors "Task" (e.g. 
MergerTask) than "Thread" because "MergerThread" implies that it is a Thread 
subclass.

> Allow taking advantage of multiple cores while compacting a single CF
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-2901
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2901
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>             Fix For: 0.8.3
>
>         Attachments: 2901.patch
>
>
> Moved from CASSANDRA-1876:
> There are five stages: read, deserialize, merge, serialize, and write. We 
> probably want to continue doing read+deserialize and serialize+write 
> together, or you waste a lot copying to/from buffers.
> So, what I would suggest is: one thread per input sstable doing read + 
> deserialize (a row at a time). A thread pool (one per core?) merging 
> corresponding rows from each input sstable. One thread doing serialize + 
> writing the output (this has to wait for the merge threads to complete 
> in-order, obviously). This should take us from being CPU bound on SSDs (since 
> only one core is compacting) to being I/O bound.
> This will require roughly 2x the memory, to allow the reader threads to work 
> ahead of the merge stage. (I.e. for each input sstable you will have up to 
> one row in a queue waiting to be merged, and the reader thread working on the 
> next.) Seems quite reasonable on that front.  You'll also want a small queue 
> size for the serialize-merged-rows executor.
> Multithreaded compaction should be either on or off. It doesn't make sense to 
> try to do things halfway (by doing the reads with a
> threadpool whose size you can grow/shrink, for instance): we still have 
> compaction threads tuned to low priority, by default, so the impact on the 
> rest of the system won't be very different. Nor do we expect to have so many 
> input sstables that we lose a lot in context switching between reader threads.
> IMO it's acceptable to punt completely on rows that are larger than memory, 
> and fall back to the old non-parallel code there. I don't see any sane way to 
> parallelize large-row compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2901) Allow taking advantage of multiple cores while compacting a single CF

Reply via email to