[jira] [Comment Edited] (CASSANDRA-14503) Internode connection management is race-prone

Jason Brown (JIRA) Wed, 24 Oct 2018 06:10:30 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662265#comment-16662265
 ]


Jason Brown edited comment on CASSANDRA-14503 at 10/24/18 1:09 PM:
-------------------------------------------------------------------

Based on testing conducted with [~jolynch] and [~vinaykumarcse], here's an 
updated branch with performance fixes and code improvements:

||v2||
|[branch|https://github.com/jasobrown/cassandra/tree/14503-v2]|
|[utests &amp; 
dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14503-v2]|
|[pull request|https://github.com/apache/cassandra/pull/289]|

The major change in this branch is I experimented with aggregating the messages 
to send into a single ByteBuf, instead of sending the messages individually to 
the netty pipeline. Since we can send up to (the current hard-coded size of) 64 
messages per each iteration of OMC.dequeueMessages(), that's 64 times to invoke 
the pipeline mechanics, 64 ByteBuf allocations (and releases), and 64 times to 
fulfill the promise corresponding to each message. If, instead, we send one 
ByteBuf (with data serialized into it) then it's just one message into the 
pipeline, one allocation, and one promise fulfillment. The primary trade-off is 
that the single buffer will be, of course, large; perhaps large enough to not 
be efficient with the netty allocator. To that end, wrote a JMH benchmark, and 
the results are compelling: TL;DR a single buffer is significantly faster than 
multiple smaller buffers. The closest case is a single buffer is twice as fast, 
with the typical percentile difference being about 10-20 times faster for the 
single buffer (1.5 micros vs. 23 micros).

To make this work, I need the allocation and serialization code to be moved 
outside of the pipeline handler (as it now needs to be invoked from OMC). I had 
already done this work with CASSANDRA-13630. Thus, I pulled that patch into 
this branch. That patch also greatly reduced the need for the ChannelWriter 
abstraction, and combined with the outstanding work in this branch, I am able 
to eliminate ChannelWriter and the confusion it added. However, I still need to 
handle large messages separately (as we don't want to use our blocking 
serializers on the event loop), so I've preserved the "move large message 
serialization on a separate thread" behavior from CASSANDRA-13630 by creating a 
new abstraction in OMC by adding (the not cleverly named) MessageDequeuer 
interface, with implementations for large messages and "small messages" 
(basically the current behavior of this patch that we've been riffing on).

One feature that we've been debating again is the whether to include the 
message coalescing feature. The current branch does not include it - mostly due 
to the fact that we've been iterating quite quickly over this code, and I broke 
it when incorporating the CASSANDRA-13630 patch (and killing off 
ChannelWriter). There is some testing happening to reevaluate the efficacy of 
message coalescing with the netty internode messaging.

Some other points of interest:

- switch OMC#backlog from ConcurrentLinkedQueue to MpscLinkedQueue from 
jctools. MpscLinkedQueue is dramtically better, and 
ConcurrentLinkedQueue#isEmpty was a CPU drain.
- improved scheduling of the consumerTask in OutboundMessagingConnection, 
though still needs a bit more refinement
- ditched the OMC.State from the last branch
- added [~jolynch]'s fixes wrt not setting a default SO_SNDBUF value
- OMC - introduced consumerTaskThread vs eventLoop member field
- ditched the auto-read in RebufferingByteBufDataInputPlus - I need to document 
this

In general I have a small bit of documenting to add, but the branch is ready 
for review.


was (Author: jasobrown):
Based on testing conducted with [~jolynch] and [~vinaykumarcse], here's an 
updated branch with performance fixes and code improvements:

||v2||
|[branch|https://github.com/jasobrown/cassandra/tree/14503-v2]|
|[utests &amp; 
dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14503-v2]|
||PR|https://github.com/apache/cassandra/pull/289|

The major change in this branch is I experimented with aggregating the messages 
to send into a single ByteBuf, instead of sending the messages individually to 
the netty pipeline. Since we can send up to (the current hard-coded size of) 64 
messages per each iteration of OMC.dequeueMessages(), that's 64 times to invoke 
the pipeline mechanics, 64 ByteBuf allocations (and releases), and 64 times to 
fulfill the promise corresponding to each message. If, instead, we send one 
ByteBuf (with data serialized into it) then it's just one message into the 
pipeline, one allocation, and one promise fulfillment. The primary trade-off is 
that the single buffer will be, of course, large; perhaps large enough to not 
be efficient with the netty allocator. To that end, wrote a JMH benchmark, and 
the results are compelling: TL;DR a single buffer is significantly faster than 
multiple smaller buffers. The closest case is a single buffer is twice as fast, 
with the typical percentile difference being about 10-20 times faster for the 
single buffer (1.5 micros vs. 23 micros).

To make this work, I need the allocation and serialization code to be moved 
outside of the pipeline handler (as it now needs to be invoked from OMC). I had 
already done this work with CASSANDRA-13630. Thus, I pulled that patch into 
this branch. That patch also greatly reduced the need for the ChannelWriter 
abstraction, and combined with the outstanding work in this branch, I am able 
to eliminate ChannelWriter and the confusion it added. However, I still need to 
handle large messages separately (as we don't want to use our blocking 
serializers on the event loop), so I've preserved the "move large message 
serialization on a separate thread" behavior from CASSANDRA-13630 by creating a 
new abstraction in OMC by adding (the not cleverly named) MessageDequeuer 
interface, with implementations for large messages and "small messages" 
(basically the current behavior of this patch that we've been riffing on).

One feature that we've been debating again is the whether to include the 
message coalescing feature. The current branch does not include it - mostly due 
to the fact that we've been iterating quite quickly over this code, and I broke 
it when incorporating the CASSANDRA-13630 patch (and killing off 
ChannelWriter). There is some testing happening to reevaluate the efficacy of 
message coalescing with the netty internode messaging.

Some other points of interest:

- switch OMC#backlog from ConcurrentLinkedQueue to MpscLinkedQueue from 
jctools. MpscLinkedQueue is dramtically better, and 
ConcurrentLinkedQueue#isEmpty was a CPU drain.
- improved scheduling of the consumerTask in OutboundMessagingConnection, 
though still needs a bit more refinement
- ditched the OMC.State from the last branch
- added [~jolynch]'s fixes wrt not setting a default SO_SNDBUF value
- OMC - introduced consumerTaskThread vs eventLoop member field
- ditched the auto-read in RebufferingByteBufDataInputPlus - I need to document 
this

In general I have a small bit of documenting to add, but the branch is ready 
for review.

> Internode connection management is race-prone
> ---------------------------------------------
>
>                 Key: CASSANDRA-14503
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14503
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Sergio Bossa
>            Assignee: Jason Brown
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following CASSANDRA-8457, internode connection management has been rewritten 
> to rely on Netty, but the new implementation in 
> {{OutboundMessagingConnection}} seems quite race prone to me, in particular 
> on those two cases:
> * {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the 
> former could run into an NPE if the latter nulls the {{channelWriter}} (but 
> this is just an example, other conflicts might happen).
> * Connection timeout and retry racing with state changing methods: 
> {{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when 
> handshaking or closing, but there's no guarantee those will be actually 
> cancelled (as they might be already running), so they might end up changing 
> the connection state concurrently with other methods (i.e. by unexpectedly 
> closing the channel or clearing the backlog).
> Overall, the thread safety of {{OutboundMessagingConnection}} is very 
> difficult to assess given the current implementation: I would suggest to 
> refactor it into a single-thread model, where all connection state changing 
> actions are enqueued on a single threaded scheduler, so that state 
> transitions can be clearly defined and checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-14503) Internode connection management is race-prone

Reply via email to