Aleksey Yeschenko commented on CASSANDRA-13993:

I agree with basically everything you said here - except what we should 
backport, so:

bq. if we do get an unknown verb id, skip the payload bytes in MessageIn. This 
leaves the input stream clean to process future messages.

Yes, please, for 4.0+.

bq. Further, I think we can eliminate the whole UNUSED_ verbs thing as that was 
an incomplete defense against unknown verbs, and it didn't account for message 

Yes please. Keep the five we have - or, four, rather, because one will be 
consumed by {{PING}} - and I'd still say let it be {{UNUSED_4}} or 5, but don't 
introduce any more in 4.0, or after 4.0. We will reclaim the existing ones 
eventually as we EOL older releases.

bq. backport part of CASSANDRA-13283 to get the Verb from a map, not an index 
array offset. This gives us safety for future-proofing against unknown verbs.

Not a bad idea, but we should probably be a bit more conservative re: what we 
backport to 3.0, and especially 2.2 at this point. How about, instead, we just 
backport {{PING}} to 3.11 and 3.0, so in the upgrade scenario there will be no 
harm to connections?

So, TL;DR, maybe do this?
1. Make 4.0 robust against {{null}} verb and skip remainders of messages we 
can't parse. There is precedent for it as well, see 
2. Stop introducing new {{UNUSED_}} verbs starting with 4.0.
3. Backport {{PING}} to 3.0 and 3.11, so upgraders from recent 3.0 and 3.11 
with the fix will have a smoother experience when going to 4.0.

> Add optional startup delay to wait until peers are ready
> --------------------------------------------------------
>                 Key: CASSANDRA-13993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13993
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>             Fix For: 4.0
> When bouncing a node in a large cluster, is can take a while to recognize the 
> rest of the cluster as available. This is especially true if using TLS on 
> internode messaging connections. The bouncing node (and any clients connected 
> to it) may see a series of Unavailable or Timeout exceptions until the node 
> is 'warmed up' as connecting to the rest of the cluster is asynchronous from 
> the rest of the startup process.
> There are two aspects that drive a node's ability to successfully communicate 
> with a peer after a bounce:
> - marking the peer as 'alive' (state that is held in gossip). This affects 
> the unavailable exceptions
> - having both open outbound and inbound connections open and ready to each 
> peer. This affects timeouts.
> Details of each of these mechanisms are described in the comments below.
> This ticket proposes adding a mechanism, optional and configurable, to delay 
> opening the client native protocol port until some percentage of the peers in 
> the cluster is marked alive and connected to/from. Thus while we potentially 
> slow down startup (delay opening the client port), we alleviate the chance 
> that queries made by clients don't hit transient unavailable/timeout 
> exceptions.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to