[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943160#comment-15943160
 ] 

Jason Brown commented on CASSANDRA-8457:
----------------------------------------

- I really like what you've done with {{HandshakeProtocol}}. It's a nice 
clarification/centralization of the handshake data.
- I'm cool with ditching the {{DisabledCoalescingStrategy}}, but didn't like 
passing {{null}} around. Thus, I switched to passing 
{{Optional<CoalesingStrategy>}} as the intent is a bit more explicit.
- With the refactorings, it made it easier to remove {{OutboundConnection}} as 
it's reduced functionality can now live simply in 
{{OutboundMessagingConnection}}, without complicating it.
- addressed the comments about timeouts on the outbound handshake side, and 
reduced them to a single timeout (the one that was already in 
{{OutboundMessagingConnection}})
- upgraded to netty 4.1.9
- updated a lot of comments and documentation.
- added in a lot more tests. code coverage is now > 90% for the 
{{org.apache.cassandra.net.async}} package.

One remaining open issue is the current patch does not expire messages on the 
producer thread, like what we used to do in {{OutboundTcpConnection#enqueue}}. 
I'm awaiting the outcome of CASSANDRA-13324, and will apply the result here.    
   

bq. In {{connectionTimeout()}}, what happens when a connection attempt timeout?

This is what 
[{{OutboundTcpConnnection}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L248]
 does, and I've replicated it here. Note that there could be ssveral connection 
attempts before the connection timeout triggers.

bq. in the DISCONNECT/NEGOTIATION_FAILURE case, we don't seem to actively 
trying to reconect

The {{DISCONNECT}} case handles the cases when the protocol versions do not 
match. In {{OutboundTcpConnnection}}, if the [peer's version is lesser 
than|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L465]
 what we expected it to be, we clear the backlog and do not attempt to 
reconnect until more messages are {{#enqueue()}}'d. On the contrary, if the 
[peer's version is greater 
than|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L488]
 what we expected it to be, we'll finish the handshake, send what's currently 
in the backlog, and then disconnect. 

TBH, I think both of these behaviors are incorrect. If the peer's version is 
lesser than what we expect, we know how to send messages (both handshake and 
message serialization), so we could send what we have in the backlog without 
throwing them away. If the peer's version is greater than what we expect, if we 
try to send messages it might fail on the receiver side due to a change in the 
handshake sequence (unlikely) or messaging framing/format (likely). WDYT? 
Either way, I'll need to update {{OutboundMessageConnection}} as it's not quite 
correct rn.

{{NEGOTIATION_FAILURE}} does what 
[{{OutboundTcpConnnection}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L453]:
 if we can't get a version from the peer, either we couldn't connect or the 
handshake failed. I think the current behavior (throw away the backlog) is 
desirable.

bq. In {{InboundHandshakeHandler.handleMessagingStartResponse()}}, maybe we 
should check the return of handshakeTimeout.cancel()

As the {{handshakeTimeout}} executes in the netty event loop, it won't complete 
with the {{#decode}} method. As {{#failHandshake()}} closes the channel, we 
won't even get the {{ThirdHandshakeMessage}} into 
{{handleMessagingStartResponse}} as {{#decode()}} would never be triggered.

bq. Regarding the sending and receiving buffer size handling, ....

I agree with all your points and have implemented them.

> nio MessagingService
> --------------------
>
>                 Key: CASSANDRA-8457
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: netty, performance
>             Fix For: 4.x
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to