[jira] [Commented] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts

T. David Hudson (JIRA) Thu, 07 Jan 2016 04:53:24 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087314#comment-15087314
 ]


T. David Hudson commented on CASSANDRA-10969:
---------------------------------------------

A single-pass rolling restart proved insufficient; there's probably an 
additional problem with gossip in this area.

Node 1's gossip generation had been being rejected by nodes 2, 3, and 4.  N2 
was the first to be restarted.  Nodetool status on N2 showed N1 up, at least 
for a little while (until N3 got restarted?).  Then nodetool status on N2 
started reporting N1 down, and in its log, it was rejecting N1's generation 
based on an old generation, despite that its system.local had a new generation. 
 Nodetool gossipinfo on N2 was reporting an old generation for N1.  After N3 
and N4 had been restarted, nodetool status commands on N2 and N3 were still 
reporting N1 down, but N4 was reporting N1 up.  Restarting N1 made no 
difference.  Restarting N2 and then N3 again was required for the cluster to 
become fully up.


> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10969
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: 4-node Cassandra 2.1.1 cluster, each node running on a 
> Linux 2.6.32-431.20.3.dl6.x86_64 VM
>            Reporter: T. David Hudson
>            Assignee: Joel Knighton
>            Priority: Minor
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my 
> control) restarted.  The remaining nodes are logging errors like this:
>     "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local 
> generation = 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the 
> one-year threshold added for CASSANDRA-8113.  The system clocks are 
> up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and 
> 3.0.x seems not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem, 
> whence the minor priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts

Reply via email to