[
https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334119#comment-14334119
]
Jonathan Ellis commented on CASSANDRA-5665:
-------------------------------------------
(1.5 yr later) Should we close this?
> Gossiper.handleMajorStateChange can lose existing node ApplicationState
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-5665
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5665
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.2.5
> Reporter: Jason Brown
> Assignee: Jason Brown
> Priority: Minor
> Labels: gossip, upgrade
> Attachments: 5665-v1.diff, 5665-v2.diff
>
>
> Dovetailing on CASSANDRA-5660, I discovered that further along during an
> upgrade, when more nodes are on the new major version, a node the previous
> version can get passed some incomplete Gossip info about another, already
> upgraded node, and the older node drops AppStat info about that node.
> I think what happens is that a 1.1 node (older rev) gets gossip info from a
> 1.2 node (A), which includes incomplete (lacking some AppState data) gossip
> info about another 1.2 node (B). The 1.1 node, which has marked incorrectly
> kicked node B out of gossip due to the bug described in #5660, then takes
> that incomplete node B info and wholesale replaces any previous known state
> about node B in Gossiper.handleMajorStateChanged. Thus, if we previously had
> DC/RACK info, it'll get dropped as part of the
> endpointStateMap.put(endpointstate). When the data being pased is incomplete,
> 1.1 will start referencing node B and gets into the NPE situation in #5498.
> Anecdotally, this bad state is short-lived, less than a few minutes, even as
> short as ten seconds, until gossip catches up and properly propagates the
> AppState data. Furthermore, when upgrading a two datacenter, 48 node cluster,
> it only occurred on two nodes for less than a minute each. Thus, the scope
> seems limited but can occur.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)