[jira] [Commented] (CASSANDRA-16518) Node restart during joining sets protocol version to V3

Stefan Miklosovic (Jira) Mon, 07 Mar 2022 12:50:07 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502550#comment-17502550
 ]


Stefan Miklosovic commented on CASSANDRA-16518:
-----------------------------------------------

I think the ultimate explanation why this is happening is like this (maybe 
handy for future readers / for reference)

In Gossipper#applyStateLocally, it gets to:
{code:java}
for (IEndpointStateChangeSubscriber subscriber : subscribers)
    subscriber.onJoin(ep, epState);
{code}
One of subscribers is ReconnectableSnitchHelper, which will eventually call 
"SystemKeyspace.updatePreferredIP" which will write only peer and preferred ip. 
So for a joining node, it will insert only peer and preferred ip hence all 
other fields are null.

Then, we kind of assume that the node will join, all is streamed etc. Once / 
upon it is fully joined, all other fields are populated so when another 
subscriber is triggered calling its onJoin again, the capping logic where it 
currently fails is called and it will do all the happy path with version 
populated and so on.

However, if we stop the node while some other node is joining a cluster, once 
we restart it, we hit this bug, because only the first subscriber has managed 
to be called so other subscriber with capping logic will see nulls because 
nothing has updated them - because we just stopped it while other node was 
joining.

Basically, we introduced nulls, never managed to fill the columns with 
non-nulls and the node crashed so restarted node sees nulls again and the 
capping logic will go south because it expects all stuff to be fully populated, 
which didnt happen, due to the node's failure.

> Node restart during joining sets protocol version to V3
> -------------------------------------------------------
>
>                 Key: CASSANDRA-16518
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16518
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Joseph Clay
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 3.11.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While joining nodes to a cluster, an old node crashed. The old node was 
> recovered however clients (datastax java) refused to connect to it.
> The driver error:
> {noformat}
> Detected added or restarted Cassandra host /<ip>:<port> but ignoring it since 
> it does not support the version V4 of the native protocol which is currently 
> in use.{noformat}
> In the recovered node cassandra logs:
> {noformat}
> INFO  o.a.c.transport.ConfiguredLimit Detected peers which do not fully 
> support protocol V4. Capping max negotiable version to V3{noformat}
> I confirmed that ALL the nodes in the cluster, joining or otherwise, were 
> apache-cassandra-3.11.6 so that error message was rather confusing.
>  Eventually after digging through the code we got to the bottom of the issue:
> https://issues.apache.org/jira/browse/CASSANDRA-15193 adds a check for node 
> version, which reverts the protocol version to V3 if any peer fails the 
> version check. Joining nodes have NULL for their version in the peers table, 
> which fails the version check.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16518) Node restart during joining sets protocol version to V3

Reply via email to