[
https://issues.apache.org/jira/browse/CASSANDRA-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218069#comment-14218069
]
Branimir Lambov commented on CASSANDRA-7075:
--------------------------------------------
First draft of the multi-volume commit log can be found
[here|https://github.com/blambov/cassandra/compare/7075-commitlog-volumes-2].
This is still a work in progress, but while I'm looking at ways to properly
test everything, I'd be interested in some opinions on where to take this next.
To be able to spread the load between drives, the new implementation switches
'volumes' on every sync request. Each volume has its own writing thread (which
in the compressed case will also be doing the compression); the segment
management thread, which handles creating and recycling segments, remains
shared for now. Each volume writes in its own CommitLogSegment, so in effect we
may write some mutations in one segment, switch to the segment in the other
drive, then switch back to writing in the first-- which means that the order of
mutations is no longer defined first by the segment ID. To deal with this I
exposed the concept of a 'section', which existed before as the set of
mutations between two sync markers, and gave the section an ID which now
replaces the segment ID in ReplayPositions. Every time we start writing to a
volume, a new section with a fresh ID is created. Every time we switch volumes,
a write for the old section is scheduled and either the volume is put back at
the end of a queue of ready-to-use volumes (if the segment is not exhausted or
there is an available reserve segment) or the management thread is woken to
prepare a new segment and put the volume back in the queue when one is ready.
Because of the new ordering, commit log replay now has to be able to sort and
operate on the level of sections (for new logs) as well as on the level of
segments (for legacy logs). The machinery is refactored a little to permit
this, and the new code is also used to select a non-conflicting section ID at
start.
For full flexibility commit log volumes are configured separately from data
volumes. If necessary, multiple volumes can be assigned to the same drive. With
archiving it's not clear where archived logs should be restored, thus I created
an option to specify that as well (with a default of sending them to the first
CL volume).
The current code has more locking than I'd like, most importantly in
CLSM.advanceVolume(), which is called every time a disk synchronization is
requested (also when a segment is full, but that has much lower frequency).
There is a noticeable impact on performance; I need more performance testing in
various configurations to quantify it. I can see three ways to continue from
here:
# Leave the locking as it is, which permits flexibility in the ordering of
volumes in the queue. This can be made use of by making queuedVolumes a
priority queue, ordered, e.g. by expected sync finish time. The latter will be
able to handle heterogeneous situations (e.g. SSDs + HDDs; more importantly
uneven distribution of requests from other parts of the code on the drives)
very well. I think this option will result in the least complex code and the
highest flexibility of the solution.
# Not permit reordering of volumes in the queue, which lets section IDs be
assigned on queue entry rather than exit; with a little more work switching to
a new section from the queue can be made a single compare-and-swap. In this
option the load necessarily has to be spread evenly between the specified CL
volumes (not necessarily between the drives as a user still may give multiple
directories on the same drive). With a single CL volume and possibly in
homogeneous scenarios this option should result in the best performance.
# As above, but put sections in the queue only when the previous sync for the
volume has completed. This option can use the drives' performance most
efficiently, but it needs another queuing layer to be able to properly deal
with situations where all drives are busy and mutations are still incoming.
I'm leaning towards (1) for the flexibility, but that may be a performance
regression in the single-volume case. Is it worth investing the time to try out
two or all three options?
> Add the ability to automatically distribute your commitlogs across all data
> volumes
> -----------------------------------------------------------------------------------
>
> Key: CASSANDRA-7075
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7075
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Tupshin Harper
> Assignee: Branimir Lambov
> Priority: Minor
> Labels: performance
> Fix For: 3.0
>
>
> given the prevalance of ssds (no need to separate commitlog and data), and
> improved jbod support, along with CASSANDRA-3578, it seems like we should
> have an option to have one commitlog per data volume, to even the load. i've
> been seeing more and more cases where there isn't an obvious "extra" volume
> to put the commitlog on, and sticking it on only one of the jbodded ssd
> volumes leads to IO imbalance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)