[
https://issues.apache.org/jira/browse/CASSANDRA-16159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276451#comment-17276451
]
Jon Meredith commented on CASSANDRA-16159:
------------------------------------------
No, not yet. I diagnosed from logs when I had a cluster in this state. It
should be possible to reproduce in an in-jvm dtest by adding some barriers in
the message filtering to get it into the bad state.
While I think this is an issue worth fixing as the gossip state will be
incorrect, it only affects users that are replacing multiple hosts at once AND
one of the replacements is also a seed which is probably rare, so while I'd
appreciate any help to get it resolved if there's tasks to close out 4.0
bugs/testing then I'd say that's higher priority as this has been a
longstanding issue.
I'm going to adjust the fix versions to reflect that it also impacts the 3.0 &
probably 3.11 series. I don't think it should block release.
> Reduce the Severity of Errors Reported in FailureDetector#isAlive()
> -------------------------------------------------------------------
>
> Key: CASSANDRA-16159
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16159
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Caleb Rackliffe
> Assignee: Jon Meredith
> Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> Noticed the following error in the failure detector during a host replacement:
> {noformat}
> java.lang.IllegalArgumentException: Unknown endpoint: 10.38.178.98:7000
> at
> org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
> at
> org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
> at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
> at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
> at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
> at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
> at
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
> at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
> at
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> {noformat}
> This particular error looks benign, given that even if it occurs, the node
> continues to handle the {{BOOT_REPLACE}} state. There are two things we might
> be able to do to improve {{FailureDetector#isAlive()}} though:
> 1.) We don’t short circuit in the case that the endpoint in question is in
> quarantine after being removed. It may be useful to check for this so we can
> avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine
> works great when the gossip message is _from_ a quarantined endpoint, but in
> this case, that would be the new/replacing and not the old/replaced one.)
> 2.) We can reduce the severity of the logging from ERROR to WARN and provide
> better context around how to determine whether or not there’s actually a
> problem. (ex. “If this occurs while trying to determine liveness for a node
> that is currently being replaced, it can be safely ignored.”)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]