Caleb Rackliffe created CASSANDRA-16159:
-------------------------------------------

             Summary: Reduce the Severity of Errors Reported in 
FailureDetector#iaAlive()
                 Key: CASSANDRA-16159
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16159
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip
            Reporter: Caleb Rackliffe


Noticed the following error in the failure detector during a host replacement:

{noformat}
java.lang.IllegalArgumentException: Unknown endpoint: 10.38.178.98:7000
        at 
org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
        at 
org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
        at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
        at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
        at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
        at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
        at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
        at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
        at 
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
{noformat}

This particular error looks benign, given that even if it occurs, the node 
continues to handle the {{BOOT_REPLACE}} state. There are two things we might 
be able to do to improve {{FailureDetector#isAlive()}} though:

1.) We don’t short circuit in the case that the endpoint in question is in 
quarantine after being removed. It may be useful to check for this so we can 
avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine 
works great when the gossip message is _from_ a quarantined endpoint, but in 
this case, that would be the new/replacing and not the old/replaced one.)

2.) We can reduce the severity of the logging from ERROR to WARN and provide 
better context around how to determine whether or not there’s actually a 
problem. (ex. “If this occurs while trying to determine liveness for a node 
that is currently being replaced, it can be safely ignored.”)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to