[
https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068186#comment-15068186
]
Didier commented on CASSANDRA-10371:
------------------------------------
Hi Stefania,
You are perfectly right ! I just fix my issue when you wrote your answer. My
problem is that in fact there is a lot of nodes impacted in this mess (not just
one : Multi DC Europe / US).
I have setup these entries in the log4j-server.properties in one node :
{code}
log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE
{/code}
With this trick I have found the culpurit nodes with a simple tail in the
system.log :
I just run a tail -f system.log | grep "TRACE" | grep -A 10 -B 10
"192.168.136.28"
{code}
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java
(line 40) Received a GossipDigestSynMessage from /10.0.2.110
TRACE [GossipStage:1] 2015-12-22 14:25:10,262 GossipDigestSynVerbHandler.java
(line 71) Gossip syn digests are : /10.10.102.97:1448271725:7650177
/10.10.2.23:1450793863:1377 /10.0.102.190:1448275278:7636527
/10.0.2.36:1450792729:4816 /192.168.136.28:1449485228:258388
{code}
Every time I found a match with a phantom node IP in the Gossip syn digests, I
run this on the affected node (in this example 10.0.2.110) :
{code}
nodetool drain && /etc/init.d/cassandra restart
{/code}
After some nodes (15 nodes), I check if I get some entries in my system.log
with the phantom nodes ... and voila !
No more phantom nodes.
Thanks for your help ;)
Didier
> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
> Key: CASSANDRA-10371
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
> Project: Cassandra
> Issue Type: Bug
> Components: Distributed Metadata
> Reporter: Brandon Williams
> Assignee: Stefania
> Priority: Minor
>
> This may apply to other dead states as well. Dead states should be expired
> after 3 days. In the case of decom we attach a timestamp to let the other
> nodes know when it should be expired. It has been observed that sometimes a
> subset of nodes in the cluster never expire the state, and through heap
> analysis of these nodes it is revealed that the epstate.isAlive check returns
> true when it should return false, which would allow the state to be evicted.
> This may have been affected by CASSANDRA-8336.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)