[
https://issues.apache.org/jira/browse/CASSANDRA-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662265#comment-16662265
]
Jason Brown edited comment on CASSANDRA-14503 at 10/24/18 1:09 PM:
-------------------------------------------------------------------
Based on testing conducted with [~jolynch] and [~vinaykumarcse], here's an
updated branch with performance fixes and code improvements:
||v2||
|[branch|https://github.com/jasobrown/cassandra/tree/14503-v2]|
|[utests &
dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14503-v2]|
|[pull request|https://github.com/apache/cassandra/pull/289]|
The major change in this branch is I experimented with aggregating the messages
to send into a single ByteBuf, instead of sending the messages individually to
the netty pipeline. Since we can send up to (the current hard-coded size of) 64
messages per each iteration of OMC.dequeueMessages(), that's 64 times to invoke
the pipeline mechanics, 64 ByteBuf allocations (and releases), and 64 times to
fulfill the promise corresponding to each message. If, instead, we send one
ByteBuf (with data serialized into it) then it's just one message into the
pipeline, one allocation, and one promise fulfillment. The primary trade-off is
that the single buffer will be, of course, large; perhaps large enough to not
be efficient with the netty allocator. To that end, wrote a JMH benchmark, and
the results are compelling: TL;DR a single buffer is significantly faster than
multiple smaller buffers. The closest case is a single buffer is twice as fast,
with the typical percentile difference being about 10-20 times faster for the
single buffer (1.5 micros vs. 23 micros).
To make this work, I need the allocation and serialization code to be moved
outside of the pipeline handler (as it now needs to be invoked from OMC). I had
already done this work with CASSANDRA-13630. Thus, I pulled that patch into
this branch. That patch also greatly reduced the need for the ChannelWriter
abstraction, and combined with the outstanding work in this branch, I am able
to eliminate ChannelWriter and the confusion it added. However, I still need to
handle large messages separately (as we don't want to use our blocking
serializers on the event loop), so I've preserved the "move large message
serialization on a separate thread" behavior from CASSANDRA-13630 by creating a
new abstraction in OMC by adding (the not cleverly named) MessageDequeuer
interface, with implementations for large messages and "small messages"
(basically the current behavior of this patch that we've been riffing on).
One feature that we've been debating again is the whether to include the
message coalescing feature. The current branch does not include it - mostly due
to the fact that we've been iterating quite quickly over this code, and I broke
it when incorporating the CASSANDRA-13630 patch (and killing off
ChannelWriter). There is some testing happening to reevaluate the efficacy of
message coalescing with the netty internode messaging.
Some other points of interest:
- switch OMC#backlog from ConcurrentLinkedQueue to MpscLinkedQueue from
jctools. MpscLinkedQueue is dramtically better, and
ConcurrentLinkedQueue#isEmpty was a CPU drain.
- improved scheduling of the consumerTask in OutboundMessagingConnection,
though still needs a bit more refinement
- ditched the OMC.State from the last branch
- added [~jolynch]'s fixes wrt not setting a default SO_SNDBUF value
- OMC - introduced consumerTaskThread vs eventLoop member field
- ditched the auto-read in RebufferingByteBufDataInputPlus - I need to document
this
In general I have a small bit of documenting to add, but the branch is ready
for review.
was (Author: jasobrown):
Based on testing conducted with [~jolynch] and [~vinaykumarcse], here's an
updated branch with performance fixes and code improvements:
||v2||
|[branch|https://github.com/jasobrown/cassandra/tree/14503-v2]|
|[utests &
dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14503-v2]|
||PR|https://github.com/apache/cassandra/pull/289|
The major change in this branch is I experimented with aggregating the messages
to send into a single ByteBuf, instead of sending the messages individually to
the netty pipeline. Since we can send up to (the current hard-coded size of) 64
messages per each iteration of OMC.dequeueMessages(), that's 64 times to invoke
the pipeline mechanics, 64 ByteBuf allocations (and releases), and 64 times to
fulfill the promise corresponding to each message. If, instead, we send one
ByteBuf (with data serialized into it) then it's just one message into the
pipeline, one allocation, and one promise fulfillment. The primary trade-off is
that the single buffer will be, of course, large; perhaps large enough to not
be efficient with the netty allocator. To that end, wrote a JMH benchmark, and
the results are compelling: TL;DR a single buffer is significantly faster than
multiple smaller buffers. The closest case is a single buffer is twice as fast,
with the typical percentile difference being about 10-20 times faster for the
single buffer (1.5 micros vs. 23 micros).
To make this work, I need the allocation and serialization code to be moved
outside of the pipeline handler (as it now needs to be invoked from OMC). I had
already done this work with CASSANDRA-13630. Thus, I pulled that patch into
this branch. That patch also greatly reduced the need for the ChannelWriter
abstraction, and combined with the outstanding work in this branch, I am able
to eliminate ChannelWriter and the confusion it added. However, I still need to
handle large messages separately (as we don't want to use our blocking
serializers on the event loop), so I've preserved the "move large message
serialization on a separate thread" behavior from CASSANDRA-13630 by creating a
new abstraction in OMC by adding (the not cleverly named) MessageDequeuer
interface, with implementations for large messages and "small messages"
(basically the current behavior of this patch that we've been riffing on).
One feature that we've been debating again is the whether to include the
message coalescing feature. The current branch does not include it - mostly due
to the fact that we've been iterating quite quickly over this code, and I broke
it when incorporating the CASSANDRA-13630 patch (and killing off
ChannelWriter). There is some testing happening to reevaluate the efficacy of
message coalescing with the netty internode messaging.
Some other points of interest:
- switch OMC#backlog from ConcurrentLinkedQueue to MpscLinkedQueue from
jctools. MpscLinkedQueue is dramtically better, and
ConcurrentLinkedQueue#isEmpty was a CPU drain.
- improved scheduling of the consumerTask in OutboundMessagingConnection,
though still needs a bit more refinement
- ditched the OMC.State from the last branch
- added [~jolynch]'s fixes wrt not setting a default SO_SNDBUF value
- OMC - introduced consumerTaskThread vs eventLoop member field
- ditched the auto-read in RebufferingByteBufDataInputPlus - I need to document
this
In general I have a small bit of documenting to add, but the branch is ready
for review.
> Internode connection management is race-prone
> ---------------------------------------------
>
> Key: CASSANDRA-14503
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14503
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Reporter: Sergio Bossa
> Assignee: Jason Brown
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Following CASSANDRA-8457, internode connection management has been rewritten
> to rely on Netty, but the new implementation in
> {{OutboundMessagingConnection}} seems quite race prone to me, in particular
> on those two cases:
> * {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the
> former could run into an NPE if the latter nulls the {{channelWriter}} (but
> this is just an example, other conflicts might happen).
> * Connection timeout and retry racing with state changing methods:
> {{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when
> handshaking or closing, but there's no guarantee those will be actually
> cancelled (as they might be already running), so they might end up changing
> the connection state concurrently with other methods (i.e. by unexpectedly
> closing the channel or clearing the backlog).
> Overall, the thread safety of {{OutboundMessagingConnection}} is very
> difficult to assess given the current implementation: I would suggest to
> refactor it into a single-thread model, where all connection state changing
> actions are enqueued on a single threaded scheduler, so that state
> transitions can be clearly defined and checked.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]