[ 
https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096514#comment-15096514
 ] 

Joel Knighton commented on CASSANDRA-10969:
-------------------------------------------

A node shows in nodetool status if its part of a set of "live nodes". A node 
will be marked "live" when we first add some of its gossip state locally if it 
can be reached with an echo message.

I suspect what happened is that N2 gossiped with N3 or N4, first applied N3 or 
N4's (outdated) gossip information about N1, sent an echo message to N1 to 
check if it was alive, received a reply, and marked N1 alive.

At this point, N1 will show as up in nodetool status. Then, since no new gossip 
deltas are applied for N1 (because of the generation gap), the failure detector 
marked N1 as down for N2.

I could try to confirm this with N1 and N2's logs.

> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10969
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: 4-node Cassandra 2.1.1 cluster, each node running on a 
> Linux 2.6.32-431.20.3.dl6.x86_64 VM
>            Reporter: T. David Hudson
>            Assignee: Joel Knighton
>            Priority: Minor
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my 
> control) restarted.  The remaining nodes are logging errors like this:
>     "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local 
> generation = 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the 
> one-year threshold added for CASSANDRA-8113.  The system clocks are 
> up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and 
> 3.0.x seems not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem, 
> whence the minor priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to