[ 
https://issues.apache.org/jira/browse/CASSANDRA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066737#comment-13066737
 ] 

Jonathan Ellis commented on CASSANDRA-2901:
-------------------------------------------

That's an interesting idea, but the more I think about it the less convinced I 
am that it's an easy win.

First of all, the premise that compaction is GC-intensive should be qualified: 
it can help cause young-gen compactions, but almost none of it will ever be 
promoted to old gen, which is what most people worry about.  Small rows are 
compacted quickly enough to not be promoted, and large rows compact 
column-at-a-time which will also not live long enough to be promoted.  (If you 
are seeing "medium size" rows get tenured, then consider reduction 
in_memory_compaction_limit_in_mb.)

Second, it's harder than it looks to actually push compaction out to another 
process, because you have basically three choices:
- use Runtime.exec or ProcessBuilder
- use JNA and vfork
- run a separate, always-on "compaction daemon" and communicate with it over 
RMI or other IPC

The first of these is implemented using fork on Linux, which can cause spurious 
OOMs when running in an environment with overcommit disabled (which is 
generally accepted as best practice in a server environment). Overcommit aside, 
copying even just the page table for a largish heap is expensive: 
http://lwn.net/Articles/360509/ 

vfork allows avoiding copying the parent process's page table, but is obviously 
not completely portable so we'd have to keep in-process compaction around as a 
fallback option.

Neither of these makes it easy to communicate back to the parent Cassandra 
process what cache rows should be invalidated (CASSANDRA-2305). This may be 
something we can live with (we did for years), but it's a regression 
nevertheless.

The compaction daemon approach avoids the above problems but adds substantial 
complexity to implementation.

tl;dr: you're welcome to experiment with it but I don't think it's at all clear 
yet that the cost/benefit is there.

> Allow taking advantage of multiple cores while compacting a single CF
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-2901
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2901
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Moved from CASSANDRA-1876:
> There are five stages: read, deserialize, merge, serialize, and write. We 
> probably want to continue doing read+deserialize and serialize+write 
> together, or you waste a lot copying to/from buffers.
> So, what I would suggest is: one thread per input sstable doing read + 
> deserialize (a row at a time). One thread merging corresponding rows from 
> each input sstable. One thread doing serialize + writing the output. This 
> should give us between 2x and 3x speedup (depending how much doing the merge 
> on another thread than write saves us).
> This will require roughly 2x the memory, to allow the reader threads to work 
> ahead of the merge stage. (I.e. for each input sstable you will have up to 
> one row in a queue waiting to be merged, and the reader thread working on the 
> next.) Seems quite reasonable on that front.
> Multithreaded compaction should be either on or off. It doesn't make sense to 
> try to do things halfway (by doing the reads with a
> threadpool whose size you can grow/shrink, for instance): we still have 
> compaction threads tuned to low priority, by default, so the impact on the 
> rest of the system won't be very different. Nor do we expect to have so many 
> input sstables that we lose a lot in context switching between reader 
> threads. (If this is a concern, we already have a tunable to limit the number 
> of sstables merged at a time in a single CF.)
> IMO it's acceptable to punt completely on rows that are larger than memory, 
> and fall back to the old non-parallel code there. I don't see any sane way to 
> parallelize large-row compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to