[ 
https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428361#comment-16428361
 ] 

Jason Brown commented on CASSANDRA-13993:
-----------------------------------------

Responding to @aleksey's comments out of order, but hopefully makes sense at 
the end.

bq. we should handle ordinals that are outside of our known range robustly 
instead.

Yeah, I think this is where we should get to.

In {{MessageIn#read()}}, we read the verb id from the stream, and then fetch 
the {{Verb}} instance. In pre-4.0, we literally index into the {{Verb[]}} in 
{{MessagingService}}, so any unknown {{Verb}} s would blow up there with an 
ArrayIndexOutOfBoundsException. With CASSANDRA-13283, committed on trunk, we 
are more intelligently resistant to unknown {{Verb}} s, and would just get a 
null {{Verb}}. Unfortunately, trunk would still have problems with an unknown 
{{Verb}} as it would not know how to deserialize the message (pre-4.0, of 
course, suffers the same problem). It justs reads the basic header data, and 
passes it down, where [it would be dropped by 
{{MessageDeliveryTask}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageDeliveryTask.java#L66].
 Unfortunately, if the message had more bytes in the stream which we didn't try 
to deseriliaze, trying to read the next message on the connection would fail 
spectacularly.

It's easy enough to avoid that, though, as we [already know the 
{{payloadSize}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageIn.java#L146],
 so we can easily skip over the payload, and leave the incoming stream in a 
clean state after we account for the unknown message. Note: {{payloadSize}} is 
required by the internode messaging protocol, so we are sure to have the 
payload size. Thus, we can just safely skip the stream forward when we don't 
know how to deserialize the message, send it forward, and just discard it at 
{{MessagedeliveryTask}}.

bq. So I was thinking about a major upgrade bounce scenario. Think the first 
ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to 
every node, but receive no pongs, correct? So every node until a threshold will 
have a significantly longer bounce. Do we care about this case?

As the {{PingMessage}} contains a one-byte payload, it would leave the stream 
in a bad (unconsumed) state. This is a bug for the upgrade scenario. It's not a 
terrible bug, but it will cause the connection that we tried to eagerly create 
(to the un-upgraded peer) to be thrown away as it will fail on the next 
succeeding message on the connection. See proposal at the end.

bq. As implemented currently, we are going to send PINGs potentially to 
3.11/3.0 - unless we switch to gating by version, which we do sometimes.

So here's the rub: we don't necessarily know the peer's version yet. The ping 
messages are sent on the large/small connections, but we're not guaranteed that 
at least one round of gossip has completed wherein we would learn the version 
of the peers (we're still at in the startup process). The un-upgraded node 
won't know how to respond to the the unkown {{Verb}}, which is acceptable, but 
we shouldn't leave the stream on that connection in a broken state (see above).

Proposal: 
- backport part of CASSANDRA-13283 to get the {{Verb}} from a map, not an index 
array offset. This gives us safety for future-proofing against unknown verbs.  
- if we do get an unknown verb id, skip the payload bytes in {{MessageIn}}. 
This leaves the input stream clean to process future messages.
- Further, I think we can eliminate the whole {{UNUSED_}} verbs thing as that 
was an incomplete defense against unknown verbs, and it didn't account for 
message payload.

I think if we backport this to at least 3.0 (maybe 2.2?) that should be 
sufficient for future-proofing against unknown messages. 

If this sounds reasonable, I'll open a separate ticket for that work.

> Add optional startup delay to wait until peers are ready
> --------------------------------------------------------
>
>                 Key: CASSANDRA-13993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13993
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>             Fix For: 4.0
>
>
> When bouncing a node in a large cluster, is can take a while to recognize the 
> rest of the cluster as available. This is especially true if using TLS on 
> internode messaging connections. The bouncing node (and any clients connected 
> to it) may see a series of Unavailable or Timeout exceptions until the node 
> is 'warmed up' as connecting to the rest of the cluster is asynchronous from 
> the rest of the startup process.
> There are two aspects that drive a node's ability to successfully communicate 
> with a peer after a bounce:
> - marking the peer as 'alive' (state that is held in gossip). This affects 
> the unavailable exceptions
> - having both open outbound and inbound connections open and ready to each 
> peer. This affects timeouts.
> Details of each of these mechanisms are described in the comments below.
> This ticket proposes adding a mechanism, optional and configurable, to delay 
> opening the client native protocol port until some percentage of the peers in 
> the cluster is marked alive and connected to/from. Thus while we potentially 
> slow down startup (delay opening the client port), we alleviate the chance 
> that queries made by clients don't hit transient unavailable/timeout 
> exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to