[
https://issues.apache.org/jira/browse/CASSANDRA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152695#comment-15152695
]
Joel Knighton commented on CASSANDRA-10371:
-------------------------------------------
Thanks - those logs confirm my suspicion that 10.0.2.128 is propagating the
EndpointState through the cluster and not evicting it. One more piece of
information will allow me to root-cause this and suggest a fix.
If you connect to 10.0.2.128 over JMX, on
org.apache.cassandra.net.FailureDetector, there should be an operation
dumpInterArrivalTimes(). Invoking that operation over JMX will create a file in
the Java temporary directory (likely just "/tmp") called "failuredetector-{SOME
NUMBERS}.dat". If you could attach that file to this ticket, I can diagnose the
issue further. There is no sensitive information in that file; it will just
contain the samples of gossip arrival time for nodes in the cluster.
Thanks again; your help in working with a running cluster with this issue is
tremendously helpful.
> Decommissioned nodes can remain in gossip
> -----------------------------------------
>
> Key: CASSANDRA-10371
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10371
> Project: Cassandra
> Issue Type: Bug
> Components: Distributed Metadata
> Reporter: Brandon Williams
> Assignee: Joel Knighton
> Priority: Minor
>
> This may apply to other dead states as well. Dead states should be expired
> after 3 days. In the case of decom we attach a timestamp to let the other
> nodes know when it should be expired. It has been observed that sometimes a
> subset of nodes in the cluster never expire the state, and through heap
> analysis of these nodes it is revealed that the epstate.isAlive check returns
> true when it should return false, which would allow the state to be evicted.
> This may have been affected by CASSANDRA-8336.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)