RE: Logs appear to contradict themselves during bootstrap steps

David Berry Fri, 06 Jan 2017 13:19:20 -0800

I’ve encountered this previously where after removing a node, gossip info is 
retained for 72 hours which doesn’t allow the IP to be reused during that 
period.   You can check how long gossip will retain this information using 
“nodetool gossipinfo” where the epoch time will be shown with status


For example….

Nodetool gossipinfo

/10.236.70.199
  generation:1482436691
  heartbeat:3942407
  STATUS:3942404:LEFT,3074457345618261000,1483995662276
  LOAD:3942267:3.60685807E8
  SCHEMA:223625:acbf0adb-1bbe-384a-acd7-6a46609497f1
  DC:20:orion
  RACK:22:r1
  RELEASE_VERSION:4:2.1.16
  RPC_ADDRESS:3:10.236.70.199
  SEVERITY:3942406:0.25094103813171387
  NET_VERSION:1:8
  HOST_ID:2:cd2a767f-3716-4717-9106-52f0380e6184
  TOKENS:15:<hidden>

Converting it from epoch…..

local@img2116saturn101:~$ date -d @$((1483995662276/1000))
Mon Jan  9 21:01:02 UTC 2017

At the time we waited the 72 hour period before reusing the IP, I’ve not used 
replace_address previously.


From: Sotirios Delimanolis [mailto:sotodel...@yahoo.com]
Sent: Friday, January 6, 2017 2:38 PM
To: User <user@cassandra.apache.org>
Subject: Logs appear to contradict themselves during bootstrap steps

We had a node go down in our cluster and its disk had to be wiped. During that 
time, all nodes in the cluster have restarted at least once.

We want to add the bad node back to the ring. It has the same IP/hostname. I 
follow the steps 
here<https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html>
 for "Adding nodes to an existing cluster."

When the process is started up, it reports

A node with address <hostname>/<address> already exists, cancelling join. Use 
cassandra.replace_address if you want to replace this node.

I found this error message in the StorageService using the Gossiper instance to 
look up the node's state. Apparently, the node knows about it. So I followed 
the instructions and added the cassandra.replace_address system property and 
restarted the process.

But it reports

Cannot replace_address /<address> because it doesn't exist in gossip

So which one is it? Does the ring know about it or not? Running "nodetool ring" 
does show it on all other nodes.

I've seen CASSANDRA-8138<https://issues.apache.org/jira/browse/CASSANDRA-8138> 
andthe conditions are the same, but I can't understand why it thinks it's not 
part of gossip. What's the difference between the gossip check used to make 
this determination and the gossip check used for the first error message? Can 
someone explain?

I've since retrieved the node's id and used it to "nodetool removenode". After 
rebalancing, I added the node back and "nodetool cleaned" up. Everything's up 
and running, but I'd like to understand what Cassandra was doing.

RE: Logs appear to contradict themselves during bootstrap steps

Reply via email to