[ 
https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368255#comment-16368255
 ] 

Jason Brown commented on CASSANDRA-13993:
-----------------------------------------

Addessed [~aweisberg]'s first round of feedback. Now initializing all 
connection types at startup. Also, I've modified {{MessageOut}} to allow 
senders to declare the connection type they want to use. Related to this, I 
corrected the behavior of the gossip `EchoMessage` on the peer's side by 
sending out on the gossip channel (as it currently responds on the small 
message channel, because it's sending a {{REQUEST_RESPONSE}}). However, it's 
still necessary to distinguish between `EchoMessage` and {{PingMessages}} as 
{{PingMessage}} includes an extra byte to express the connection type the peer 
should use. Deserialization of {{EchoMessage}} on a node that doesn't know to 
read the extra byte at the end will cause problems on that connection when 
trying to deserialize the next message as there's that extra byte it wasn't 
expecting.

Also, I don't need to make {{PongMessage}} a verb as is won't need a custom 
{{VerbHandler}}; it can just use {{ResponseVerbHandler}}, which is assigned to 
{{RESPONSE_RESPONSE}} messages.

The main logic of this patch was originally in {{MessagingService}}, but I've 
moved it into it's own class ({{StartupClusterConnectivityChecker}}) and 
slightly refactored it to make unit testing easier. Also, added a unit test.

Cleaned up the comments on {{MessagingService.Verb}} to be more correct and 
more clearer wrt intent and use. Added a sanity check in the static block 
within {{MessagingService.Verb}} where we build up the {{#idToVerbMap}}. We 
should never allow two verbs to have the same id.

 

> Add optional startup delay to wait until peers are ready
> --------------------------------------------------------
>
>                 Key: CASSANDRA-13993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13993
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>             Fix For: 4.x
>
>
> When bouncing a node in a large cluster, is can take a while to recognize the 
> rest of the cluster as available. This is especially true if using TLS on 
> internode messaging connections. The bouncing node (and any clients connected 
> to it) may see a series of Unavailable or Timeout exceptions until the node 
> is 'warmed up' as connecting to the rest of the cluster is asynchronous from 
> the rest of the startup process.
> There are two aspects that drive a node's ability to successfully communicate 
> with a peer after a bounce:
> - marking the peer as 'alive' (state that is held in gossip). This affects 
> the unavailable exceptions
> - having both open outbound and inbound connections open and ready to each 
> peer. This affects timeouts.
> Details of each of these mechanisms are described in the comments below.
> This ticket proposes adding a mechanism, optional and configurable, to delay 
> opening the client native protocol port until some percentage of the peers in 
> the cluster is marked alive and connected to/from. Thus while we potentially 
> slow down startup (delay opening the client port), we alleviate the chance 
> that queries made by clients don't hit transient unavailable/timeout 
> exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to