[
https://issues.apache.org/jira/browse/CASSANDRA-16159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271613#comment-17271613
]
Ekaterina Dimitrova commented on CASSANDRA-16159:
-------------------------------------------------
Hey [~jmeredithco], [~maedhroz], I was wondering whether you guys are working
on this one? Let me know if you won't have the time and I can help.
> Reduce the Severity of Errors Reported in FailureDetector#isAlive()
> -------------------------------------------------------------------
>
> Key: CASSANDRA-16159
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16159
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Caleb Rackliffe
> Assignee: Jon Meredith
> Priority: Normal
> Fix For: 4.0-beta
>
>
> Noticed the following error in the failure detector during a host replacement:
> {noformat}
> java.lang.IllegalArgumentException: Unknown endpoint: 10.38.178.98:7000
> at
> org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
> at
> org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
> at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
> at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
> at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
> at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
> at
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
> at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
> at
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> {noformat}
> This particular error looks benign, given that even if it occurs, the node
> continues to handle the {{BOOT_REPLACE}} state. There are two things we might
> be able to do to improve {{FailureDetector#isAlive()}} though:
> 1.) We don’t short circuit in the case that the endpoint in question is in
> quarantine after being removed. It may be useful to check for this so we can
> avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine
> works great when the gossip message is _from_ a quarantined endpoint, but in
> this case, that would be the new/replacing and not the old/replaced one.)
> 2.) We can reduce the severity of the logging from ERROR to WARN and provide
> better context around how to determine whether or not there’s actually a
> problem. (ex. “If this occurs while trying to determine liveness for a node
> that is currently being replaced, it can be safely ignored.”)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]