[
https://issues.apache.org/jira/browse/CASSANDRA-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086025#comment-18086025
]
Cameron Zemek edited comment on CASSANDRA-21428 at 6/4/26 9:27 AM:
-------------------------------------------------------------------
Was on-call with customer that was hitting this issue. Deployed this patch and
enabled TRACE logging and can confirm it had this problem and the patch
resolved it (you see it timeout and then resend with the patch, without the
patch it was infinite stuck in skipping send of ECHO_REQ)
Applied patch against 5.0.7
was (Author: cam1982):
Was on-call with customer that was hitting this issue. Deployed this patch and
enabled TRACE logging and can confirm it had this problem and the patch
resolved it (you see it timeout and then resend with the patch, without the
patch it was infinite stuck in skipping send of ECHO_REQ)
> Nodes can become stuck in DOWN if ECHO_REQ has timeout
> ------------------------------------------------------
>
> Key: CASSANDRA-21428
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21428
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Cameron Zemek
> Priority: Normal
>
> In Gossiper, echoHandler only implements onResponse.
> RequestCallback.onFailure has a default no-op, so when the ECHO_REQ times out
> or the remote node returns an error, inflightEcho.remove(addr) is never
> called. The stale entry persists. Any subsequent markAlive(addr, localState)
> call — where localState is the same in-place-mutated object already in
> inflightEcho — sees localState.equals(prevState) = true (identity equality,
> same reference) and skips indefinitely. In a temporary-partition scenario
> (node briefly unreachable, echo times out, node recovers with the same
> generation), the node can get stuck permanently dead: the failure detector
> sees it as alive and keeps triggering markAlive, but every invocation is
> suppressed by the stale entry. The stale entry is only cleared by
> removeEndpoint() (explicit removal) or silentlyMarkDead() via markDead()
> (failure detector conviction) — neither of which fires if the failure
> detector is reporting the node as healthy.
> Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).
> {code:java}
> diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java
> b/src/java/org/apache/cassandra/gms/Gossiper.java
> index 647441ffba..6d046cd8ba 100644
> --- a/src/java/org/apache/cassandra/gms/Gossiper.java
> +++ b/src/java/org/apache/cassandra/gms/Gossiper.java
> @@ -65,6 +65,7 @@ import
> org.apache.cassandra.config.CassandraRelevantProperties;
> import org.apache.cassandra.config.DatabaseDescriptor;
> import org.apache.cassandra.db.SystemKeyspace;
> import org.apache.cassandra.dht.Token;
> +import org.apache.cassandra.exceptions.RequestFailureReason;
> import org.apache.cassandra.locator.InetAddressAndPort;
> import org.apache.cassandra.net.Message;
> import org.apache.cassandra.net.MessagingService;
> @@ -1442,13 +1443,30 @@ public class Gossiper implements
> IFailureDetectionEventListener, GossiperMBean,
> {
> Message<NoPayload> echoMessage = Message.out(ECHO_REQ,
> noPayload);
> logger.trace("Sending ECHO_REQ to {}", addr);
> - RequestCallback echoHandler = msg ->
> + RequestCallback echoHandler = new RequestCallback()
> {
> - runInGossipStageBlocking(() -> {
> - EndpointState epState = inflightEcho.remove(addr);
> - if (epState != null)
> - realMarkAlive(addr, epState);
> - });
> + @Override
> + public void onResponse(Message msg)
> + {
> + runInGossipStageBlocking(() -> {
> + EndpointState epState = inflightEcho.remove(addr);
> + if (epState != null)
> + realMarkAlive(addr, epState);
> + });
> + }
> +
> + @Override
> + public boolean invokeOnFailure()
> + {
> + return true;
> + }
> +
> + @Override
> + public void onFailure(InetAddressAndPort from,
> RequestFailureReason failureReason)
> + {
> + logger.trace("ECHO_REQ to {} failed ({})", addr,
> failureReason);
> + inflightEcho.remove(addr);
> + }
> };
> MessagingService.instance().sendWithCallback(echoMessage, addr,
> echoHandler);
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]