[
https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428460#comment-16428460
]
Aleksey Yeschenko commented on CASSANDRA-13993:
-----------------------------------------------
I agree with basically everything you said here - except what we should
backport, so:
bq. if we do get an unknown verb id, skip the payload bytes in MessageIn. This
leaves the input stream clean to process future messages.
Yes, please, for 4.0+.
bq. Further, I think we can eliminate the whole UNUSED_ verbs thing as that was
an incomplete defense against unknown verbs, and it didn't account for message
payload.
Yes please. Keep the five we have - or, four, rather, because one will be
consumed by {{PING}} - and I'd still say let it be {{UNUSED_4}} or 5, but don't
introduce any more in 4.0, or after 4.0. We will reclaim the existing ones
eventually as we EOL older releases.
bq. backport part of CASSANDRA-13283 to get the Verb from a map, not an index
array offset. This gives us safety for future-proofing against unknown verbs.
Not a bad idea, but we should probably be a bit more conservative re: what we
backport to 3.0, and especially 2.2 at this point. How about, instead, we just
backport {{PING}} to 3.11 and 3.0, so in the upgrade scenario there will be no
harm to connections?
So, TL;DR, maybe do this?
1. Make 4.0 robust against {{null}} verb and skip remainders of messages we
can't parse. There is precedent for it as well, see
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hints/HintMessage.java#L126-L128
2. Stop introducing new {{UNUSED_}} verbs starting with 4.0.
3. Backport {{PING}} to 3.0 and 3.11, so upgraders from recent 3.0 and 3.11
with the fix will have a smoother experience when going to 4.0.
> Add optional startup delay to wait until peers are ready
> --------------------------------------------------------
>
> Key: CASSANDRA-13993
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13993
> Project: Cassandra
> Issue Type: Improvement
> Components: Lifecycle
> Reporter: Jason Brown
> Assignee: Jason Brown
> Priority: Minor
> Fix For: 4.0
>
>
> When bouncing a node in a large cluster, is can take a while to recognize the
> rest of the cluster as available. This is especially true if using TLS on
> internode messaging connections. The bouncing node (and any clients connected
> to it) may see a series of Unavailable or Timeout exceptions until the node
> is 'warmed up' as connecting to the rest of the cluster is asynchronous from
> the rest of the startup process.
> There are two aspects that drive a node's ability to successfully communicate
> with a peer after a bounce:
> - marking the peer as 'alive' (state that is held in gossip). This affects
> the unavailable exceptions
> - having both open outbound and inbound connections open and ready to each
> peer. This affects timeouts.
> Details of each of these mechanisms are described in the comments below.
> This ticket proposes adding a mechanism, optional and configurable, to delay
> opening the client native protocol port until some percentage of the peers in
> the cluster is marked alive and connected to/from. Thus while we potentially
> slow down startup (delay opening the client port), we alleviate the chance
> that queries made by clients don't hit transient unavailable/timeout
> exceptions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]