[
https://issues.apache.org/jira/browse/CASSANDRA-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195902#comment-13195902
]
Michaël Figuière commented on CASSANDRA-3578:
---------------------------------------------
I propose a different approach than Piotr in this patch. In this
implementation, we only have one thread to handle syncs, all the processing,
that is serialization, CRC and copying the RM into the mmap segment is handled
directly in the writer threads. These threads exchange data with the syncer
thread in a non blocking way, thus the ExecutorService abstraction has been
replaced by a lighter structure.
Several components of the CL presented some challenges to implement in such a
manner:
*CL Segment switch*
Switching CL segment when it's full isn't straightforward without locks. Here
we use a boolean mark the is atomically CASed by a writer thread giving him the
responsibility for performing the switch. If the mark can't be grabbed, the
thread is waiting on a condition which is later reused using stamps to avoid
any ABA problem.
*Batch CL*
The Batch CL strategy is considered as a safer mode for Cassandra as it
guarantee the client that the RM is synced on disk before answering. Making the
CL multithreaded, we must ensure that we don't acknowledge a RM that is synced
on disk but preceded by an unsynced RM in the CL Segment as it would make the
replaying of the RM impossible. For this reason, we track the state of each RM
processing, and mark as synced any continuous set of RM fully written when the
sync() call is executed.
Avoiding any blocking queue, we still need a way to put the writer threads on
hold while the sync is being ensured. LockSupport.park()/unpark() provides a
nice way the do it without relying on any coarse grain synchronization and
avoiding any condition reuse/renewing issue.
*Periodic CL*
The Periodic CL's challenge is mostly around the throttling of the writers as
here again we don't use any synchronized queue to reduce contention. Actually
here we just need "half a blocking queue" as nothing is really added or
consumed. For this reason, here we just use an atomic counter and a empty/full
condition couple. Here again, a pool of conditions and a stamp are used to
avoid the ABA problem.
*End of Segment marker*
Another point is that this implementation don't use any End of Segment marker.
As we now have several concurrent writers, it's not possible anymore to write
temporary marker after an entry. That mean that the recently committed code
that fix CASSANDRA-3615 is obviously not included in this patch.
Nevertheless, a mechanism to avoid unwanted replay of entry from recycled
segment is still required. I haven't included it in the patch as I think it's a
design choice that need to be debated but that seem straightforward to
implement. The options I can see are the following:
- Fill CL segment file with 0 on recycling. Doing so avoid any problem but will
typically require a several second write on recycling that will lead to write
latency hiccup.
- Include segment id in every entry. This avoid any problem as well but
increase the entry size by 8 bytes which has a cost but isn't a drama and can't
be considered as spreading the cost of the previous option over the entire CL
writing.
- Salting the two checksums included in the entry with the segment id. Doing so
lowers the probability of any unwanted replay to happen to a level that seems
fairly acceptable. The advantage of this solution is that its performance cost
is null.
Finally, here are some noteworthy observations:
* Here the writer thread WAITS for the processing to complete. Compared to a
_push-on-queue-and-forget_ approach, this slightly increases write latency when
using the Periodic CL (the Batch CL still being synchronous) especially for
large RMs. Nevertheless, in a highly loaded server, the next writes waiting to
be executed would have to wait anyway for their thread to be scheduled, thus
the latency cost might eventually be paid. Increasing the number of writer
thread should help to increase the insensitiveness of the small RMs to the
large RMs.
* If extensive benchmarks tend to show that the previous point is an issue,
there's some room to make this Periodic CL asynchronous with the writer threads.
* To reduce as much as possible the contention on the atomic states that can be
modified several time by each thread, some naughty packing of several states
within a single AtomicLong is used as it decreases the likeliness of an extra
spin to happen compare to a more classical AtomicReference approach to
non-blocking synchronization. The downside is code complexity, thus I think
AtomicReference still stay an option to make the code more readable and
maintainable.
* Actually for now to ensure the required throttling of incoming RM we use a
constant function with a fixed threshold of unsynced mutation. But we now have
the tools to easily make the function more complex, like making it non constant
and including some relation to the size of the mutations for instance.
> Multithreaded commitlog
> -----------------------
>
> Key: CASSANDRA-3578
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3578
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jonathan Ellis
> Priority: Minor
> Attachments: parallel_commit_log_2.patch
>
>
> Brian Aker pointed out a while ago that allowing multiple threads to modify
> the commitlog simultaneously (reserving space for each with a CAS first, the
> way we do in the SlabAllocator.Region.allocate) can improve performance,
> since you're not bottlenecking on a single thread to do all the copying and
> CRC computation.
> Now that we use mmap'd CommitLog segments (CASSANDRA-3411) this becomes
> doable.
> (moved from CASSANDRA-622, which was getting a bit muddled.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira