Cameron Zemek created CASSANDRA-21428:
-----------------------------------------

             Summary: Nodes can become stuck in DOWN if ECHO_REQ has timeout
                 Key: CASSANDRA-21428
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21428
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip
            Reporter: Cameron Zemek


In Gossiper, echoHandler only implements onResponse. RequestCallback.onFailure 
has a default no-op, so when the ECHO_REQ times out or the remote node returns 
an error, inflightEcho.remove(addr) is never called. The stale entry persists. 
Any subsequent markAlive(addr, localState) call — where localState is the same 
in-place-mutated object already in inflightEcho — sees 
localState.equals(prevState) = true (identity equality, same reference) and 
skips indefinitely. In a temporary-partition scenario (node briefly 
unreachable, echo times out, node recovers with the same generation), the node 
can get stuck permanently dead: the failure detector sees it as alive and keeps 
triggering markAlive, but every invocation is suppressed by the stale entry. 
The stale entry is only cleared by removeEndpoint() (explicit removal) or 
silentlyMarkDead() via markDead() (failure detector conviction) — neither of 
which fires if the failure detector is reporting the node as healthy.

Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).
{code:java}
diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java 
b/src/java/org/apache/cassandra/gms/Gossiper.java
index 647441ffba..6d046cd8ba 100644
--- a/src/java/org/apache/cassandra/gms/Gossiper.java
+++ b/src/java/org/apache/cassandra/gms/Gossiper.java
@@ -65,6 +65,7 @@ import 
org.apache.cassandra.config.CassandraRelevantProperties;
 import org.apache.cassandra.config.DatabaseDescriptor;
 import org.apache.cassandra.db.SystemKeyspace;
 import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.exceptions.RequestFailureReason;
 import org.apache.cassandra.locator.InetAddressAndPort;
 import org.apache.cassandra.net.Message;
 import org.apache.cassandra.net.MessagingService;
@@ -1442,13 +1443,30 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean,
         {
             Message<NoPayload> echoMessage = Message.out(ECHO_REQ, noPayload);
             logger.trace("Sending ECHO_REQ to {}", addr);
-            RequestCallback echoHandler = msg ->
+            RequestCallback echoHandler = new RequestCallback()
             {
-                runInGossipStageBlocking(() -> {
-                    EndpointState epState = inflightEcho.remove(addr);
-                    if (epState != null)
-                        realMarkAlive(addr, epState);
-                });
+                @Override
+                public void onResponse(Message msg)
+                {
+                    runInGossipStageBlocking(() -> {
+                        EndpointState epState = inflightEcho.remove(addr);
+                        if (epState != null)
+                            realMarkAlive(addr, epState);
+                    });
+                }
+
+                @Override
+                public boolean invokeOnFailure()
+                {
+                    return true;
+                }
+
+                @Override
+                public void onFailure(InetAddressAndPort from, 
RequestFailureReason failureReason)
+                {
+                    logger.trace("ECHO_REQ to {} failed ({})", addr, 
failureReason);
+                    inflightEcho.remove(addr);
+                }
             };
             MessagingService.instance().sendWithCallback(echoMessage, addr, 
echoHandler);
         } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to