[ 
https://issues.apache.org/jira/browse/CASSANDRA-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218069#comment-14218069
 ] 

Branimir Lambov commented on CASSANDRA-7075:
--------------------------------------------

First draft of the multi-volume commit log can be found 
[here|https://github.com/blambov/cassandra/compare/7075-commitlog-volumes-2]. 
This is still a work in progress, but while I'm looking at ways to properly 
test everything, I'd be interested in some opinions on where to take this next.

To be able to spread the load between drives, the new implementation switches 
'volumes' on every sync request. Each volume has its own writing thread (which 
in the compressed case will also be doing the compression); the segment 
management thread, which handles creating and recycling segments, remains 
shared for now. Each volume writes in its own CommitLogSegment, so in effect we 
may write some mutations in one segment, switch to the segment in the other 
drive, then switch back to writing in the first-- which means that the order of 
mutations is no longer defined first by the segment ID. To deal with this I 
exposed the concept of a 'section', which existed before as the set of 
mutations between two sync markers, and gave the section an ID which now 
replaces the segment ID in ReplayPositions. Every time we start writing to a 
volume, a new section with a fresh ID is created. Every time we switch volumes, 
a write for the old section is scheduled and either the volume is put back at 
the end of a queue of ready-to-use volumes (if the segment is not exhausted or 
there is an available reserve segment) or the management thread is woken to 
prepare a new segment and put the volume back in the queue when one is ready.

Because of the new ordering, commit log replay now has to be able to sort and 
operate on the level of sections (for new logs) as well as on the level of 
segments (for legacy logs). The machinery is refactored a little to permit 
this, and the new code is also used to select a non-conflicting section ID at 
start.

For full flexibility commit log volumes are configured separately from data 
volumes. If necessary, multiple volumes can be assigned to the same drive. With 
archiving it's not clear where archived logs should be restored, thus I created 
an option to specify that as well (with a default of sending them to the first 
CL volume).

The current code has more locking than I'd like, most importantly in 
CLSM.advanceVolume(), which is called every time a disk synchronization is 
requested (also when a segment is full, but that has much lower frequency). 
There is a noticeable impact on performance; I need more performance testing in 
various configurations to quantify it. I can see three ways to continue from 
here:

# Leave the locking as it is, which permits flexibility in the ordering of 
volumes in the queue. This can be made use of by making queuedVolumes a 
priority queue, ordered, e.g. by expected sync finish time. The latter will be 
able to handle heterogeneous situations (e.g. SSDs + HDDs; more importantly 
uneven distribution of requests from other parts of the code on the drives) 
very well. I think this option will result in the least complex code and the 
highest flexibility of the solution.
# Not permit reordering of volumes in the queue, which lets section IDs be 
assigned on queue entry rather than exit; with a little more work switching to 
a new section from the queue can be made a single compare-and-swap. In this 
option the load necessarily has to be spread evenly between the specified CL 
volumes (not necessarily between the drives as a user still may give multiple 
directories on the same drive). With a single CL volume and possibly in 
homogeneous scenarios this option should result in the best performance.
# As above, but put sections in the queue only when the previous sync for the 
volume has completed. This option can use the drives' performance most 
efficiently, but it needs another queuing layer to be able to properly deal 
with situations where all drives are busy and mutations are still incoming.

I'm leaning towards (1) for the flexibility, but that may be a performance 
regression in the single-volume case. Is it worth investing the time to try out 
two or all three options?

> Add the ability to automatically distribute your commitlogs across all data 
> volumes
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7075
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7075
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Branimir Lambov
>            Priority: Minor
>              Labels: performance
>             Fix For: 3.0
>
>
> given the prevalance of ssds (no need to separate commitlog and data), and 
> improved jbod support, along with CASSANDRA-3578, it seems like we should 
> have an option to have one commitlog per data volume, to even the load. i've 
> been seeing more and more cases where there isn't an obvious "extra" volume 
> to put the commitlog on, and sticking it on only one of the jbodded ssd 
> volumes leads to IO imbalance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to