[
https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087314#comment-15087314
]
T. David Hudson commented on CASSANDRA-10969:
---------------------------------------------
A single-pass rolling restart proved insufficient; there's probably an
additional problem with gossip in this area.
Node 1's gossip generation had been being rejected by nodes 2, 3, and 4. N2
was the first to be restarted. Nodetool status on N2 showed N1 up, at least
for a little while (until N3 got restarted?). Then nodetool status on N2
started reporting N1 down, and in its log, it was rejecting N1's generation
based on an old generation, despite that its system.local had a new generation.
Nodetool gossipinfo on N2 was reporting an old generation for N1. After N3
and N4 had been restarted, nodetool status commands on N2 and N3 were still
reporting N1 down, but N4 was reporting N1 up. Restarting N1 made no
difference. Restarting N2 and then N3 again was required for the cluster to
become fully up.
> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
> Key: CASSANDRA-10969
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
> Project: Cassandra
> Issue Type: Bug
> Components: Coordination
> Environment: 4-node Cassandra 2.1.1 cluster, each node running on a
> Linux 2.6.32-431.20.3.dl6.x86_64 VM
> Reporter: T. David Hudson
> Assignee: Joel Knighton
> Priority: Minor
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my
> control) restarted. The remaining nodes are logging errors like this:
> "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local
> generation = 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the
> one-year threshold added for CASSANDRA-8113. The system clocks are
> up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and
> 3.0.x seems not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem,
> whence the minor priority.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)