[ https://issues.apache.org/jira/browse/CASSANDRA-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428361#comment-16428361 ]
Jason Brown commented on CASSANDRA-13993: ----------------------------------------- Responding to @aleksey's comments out of order, but hopefully makes sense at the end. bq. we should handle ordinals that are outside of our known range robustly instead. Yeah, I think this is where we should get to. In {{MessageIn#read()}}, we read the verb id from the stream, and then fetch the {{Verb}} instance. In pre-4.0, we literally index into the {{Verb[]}} in {{MessagingService}}, so any unknown {{Verb}} s would blow up there with an ArrayIndexOutOfBoundsException. With CASSANDRA-13283, committed on trunk, we are more intelligently resistant to unknown {{Verb}} s, and would just get a null {{Verb}}. Unfortunately, trunk would still have problems with an unknown {{Verb}} as it would not know how to deserialize the message (pre-4.0, of course, suffers the same problem). It justs reads the basic header data, and passes it down, where [it would be dropped by {{MessageDeliveryTask}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageDeliveryTask.java#L66]. Unfortunately, if the message had more bytes in the stream which we didn't try to deseriliaze, trying to read the next message on the connection would fail spectacularly. It's easy enough to avoid that, though, as we [already know the {{payloadSize}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessageIn.java#L146], so we can easily skip over the payload, and leave the incoming stream in a clean state after we account for the unknown message. Note: {{payloadSize}} is required by the internode messaging protocol, so we are sure to have the payload size. Thus, we can just safely skip the stream forward when we don't know how to deserialize the message, send it forward, and just discard it at {{MessagedeliveryTask}}. bq. So I was thinking about a major upgrade bounce scenario. Think the first ever node to upgrade to 4.0 in a cluster of 3.0 nodes - will send out pings to every node, but receive no pongs, correct? So every node until a threshold will have a significantly longer bounce. Do we care about this case? As the {{PingMessage}} contains a one-byte payload, it would leave the stream in a bad (unconsumed) state. This is a bug for the upgrade scenario. It's not a terrible bug, but it will cause the connection that we tried to eagerly create (to the un-upgraded peer) to be thrown away as it will fail on the next succeeding message on the connection. See proposal at the end. bq. As implemented currently, we are going to send PINGs potentially to 3.11/3.0 - unless we switch to gating by version, which we do sometimes. So here's the rub: we don't necessarily know the peer's version yet. The ping messages are sent on the large/small connections, but we're not guaranteed that at least one round of gossip has completed wherein we would learn the version of the peers (we're still at in the startup process). The un-upgraded node won't know how to respond to the the unkown {{Verb}}, which is acceptable, but we shouldn't leave the stream on that connection in a broken state (see above). Proposal: - backport part of CASSANDRA-13283 to get the {{Verb}} from a map, not an index array offset. This gives us safety for future-proofing against unknown verbs. - if we do get an unknown verb id, skip the payload bytes in {{MessageIn}}. This leaves the input stream clean to process future messages. - Further, I think we can eliminate the whole {{UNUSED_}} verbs thing as that was an incomplete defense against unknown verbs, and it didn't account for message payload. I think if we backport this to at least 3.0 (maybe 2.2?) that should be sufficient for future-proofing against unknown messages. If this sounds reasonable, I'll open a separate ticket for that work. > Add optional startup delay to wait until peers are ready > -------------------------------------------------------- > > Key: CASSANDRA-13993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13993 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle > Reporter: Jason Brown > Assignee: Jason Brown > Priority: Minor > Fix For: 4.0 > > > When bouncing a node in a large cluster, is can take a while to recognize the > rest of the cluster as available. This is especially true if using TLS on > internode messaging connections. The bouncing node (and any clients connected > to it) may see a series of Unavailable or Timeout exceptions until the node > is 'warmed up' as connecting to the rest of the cluster is asynchronous from > the rest of the startup process. > There are two aspects that drive a node's ability to successfully communicate > with a peer after a bounce: > - marking the peer as 'alive' (state that is held in gossip). This affects > the unavailable exceptions > - having both open outbound and inbound connections open and ready to each > peer. This affects timeouts. > Details of each of these mechanisms are described in the comments below. > This ticket proposes adding a mechanism, optional and configurable, to delay > opening the client native protocol port until some percentage of the peers in > the cluster is marked alive and connected to/from. Thus while we potentially > slow down startup (delay opening the client port), we alleviate the chance > that queries made by clients don't hit transient unavailable/timeout > exceptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org