[
https://issues.apache.org/jira/browse/CASSANDRA-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502550#comment-17502550
]
Stefan Miklosovic commented on CASSANDRA-16518:
-----------------------------------------------
I think the ultimate explanation why this is happening is like this (maybe
handy for future readers / for reference)
In Gossipper#applyStateLocally, it gets to:
{code:java}
for (IEndpointStateChangeSubscriber subscriber : subscribers)
subscriber.onJoin(ep, epState);
{code}
One of subscribers is ReconnectableSnitchHelper, which will eventually call
"SystemKeyspace.updatePreferredIP" which will write only peer and preferred ip.
So for a joining node, it will insert only peer and preferred ip hence all
other fields are null.
Then, we kind of assume that the node will join, all is streamed etc. Once /
upon it is fully joined, all other fields are populated so when another
subscriber is triggered calling its onJoin again, the capping logic where it
currently fails is called and it will do all the happy path with version
populated and so on.
However, if we stop the node while some other node is joining a cluster, once
we restart it, we hit this bug, because only the first subscriber has managed
to be called so other subscriber with capping logic will see nulls because
nothing has updated them - because we just stopped it while other node was
joining.
Basically, we introduced nulls, never managed to fill the columns with
non-nulls and the node crashed so restarted node sees nulls again and the
capping logic will go south because it expects all stuff to be fully populated,
which didnt happen, due to the node's failure.
> Node restart during joining sets protocol version to V3
> -------------------------------------------------------
>
> Key: CASSANDRA-16518
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16518
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client
> Reporter: Joseph Clay
> Assignee: Stefan Miklosovic
> Priority: Normal
> Fix For: 3.11.x
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> While joining nodes to a cluster, an old node crashed. The old node was
> recovered however clients (datastax java) refused to connect to it.
> The driver error:
> {noformat}
> Detected added or restarted Cassandra host /<ip>:<port> but ignoring it since
> it does not support the version V4 of the native protocol which is currently
> in use.{noformat}
> In the recovered node cassandra logs:
> {noformat}
> INFO o.a.c.transport.ConfiguredLimit Detected peers which do not fully
> support protocol V4. Capping max negotiable version to V3{noformat}
> I confirmed that ALL the nodes in the cluster, joining or otherwise, were
> apache-cassandra-3.11.6 so that error message was rather confusing.
> Eventually after digging through the code we got to the bottom of the issue:
> https://issues.apache.org/jira/browse/CASSANDRA-15193 adds a check for node
> version, which reverts the protocol version to V3 if any peer fails the
> version check. Joining nodes have NULL for their version in the peers table,
> which fails the version check.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]