[
https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094418#comment-15094418
]
Joel Knighton commented on CASSANDRA-10969:
-------------------------------------------
Sorry - I missed your reply.
I suspect the issue was that on restart, N2 first gossiped with N3 or N4 that
contained the old generation. This would have contaminated it and put it in the
same state as before.
If N4 restarted and first gossiped with N1, it would have received the new
generation. The odds are then much better for N2 or N3 to gossip with a node
with the correct generation on restart.
It now seems clear that rolling restarts will eventually solve the issue based
on which with node gossip first occurs, but a single rolling restart may not be
sufficient. My apologies if my initial advice caused any pain.
The planned patch will remove the need for a rolling restart in the first
place, solving the issue. I'm testing it now.
Thanks for the detailed reports; it makes debugging the issue much easier.
> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
> Key: CASSANDRA-10969
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
> Project: Cassandra
> Issue Type: Bug
> Components: Coordination
> Environment: 4-node Cassandra 2.1.1 cluster, each node running on a
> Linux 2.6.32-431.20.3.dl6.x86_64 VM
> Reporter: T. David Hudson
> Assignee: Joel Knighton
> Priority: Minor
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my
> control) restarted. The remaining nodes are logging errors like this:
> "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local
> generation = 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the
> one-year threshold added for CASSANDRA-8113. The system clocks are
> up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and
> 3.0.x seems not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem,
> whence the minor priority.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)