Cameron Zemek created CASSANDRA-21428:
-----------------------------------------
Summary: Nodes can become stuck in DOWN if ECHO_REQ has timeout
Key: CASSANDRA-21428
URL: https://issues.apache.org/jira/browse/CASSANDRA-21428
Project: Apache Cassandra
Issue Type: Bug
Components: Cluster/Gossip
Reporter: Cameron Zemek
In Gossiper, echoHandler only implements onResponse. RequestCallback.onFailure
has a default no-op, so when the ECHO_REQ times out or the remote node returns
an error, inflightEcho.remove(addr) is never called. The stale entry persists.
Any subsequent markAlive(addr, localState) call — where localState is the same
in-place-mutated object already in inflightEcho — sees
localState.equals(prevState) = true (identity equality, same reference) and
skips indefinitely. In a temporary-partition scenario (node briefly
unreachable, echo times out, node recovers with the same generation), the node
can get stuck permanently dead: the failure detector sees it as alive and keeps
triggering markAlive, but every invocation is suppressed by the stale entry.
The stale entry is only cleared by removeEndpoint() (explicit removal) or
silentlyMarkDead() via markDead() (failure detector conviction) — neither of
which fires if the failure detector is reporting the node as healthy.
Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).
{code:java}
diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java
b/src/java/org/apache/cassandra/gms/Gossiper.java
index 647441ffba..6d046cd8ba 100644
--- a/src/java/org/apache/cassandra/gms/Gossiper.java
+++ b/src/java/org/apache/cassandra/gms/Gossiper.java
@@ -65,6 +65,7 @@ import
org.apache.cassandra.config.CassandraRelevantProperties;
import org.apache.cassandra.config.DatabaseDescriptor;
import org.apache.cassandra.db.SystemKeyspace;
import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.exceptions.RequestFailureReason;
import org.apache.cassandra.locator.InetAddressAndPort;
import org.apache.cassandra.net.Message;
import org.apache.cassandra.net.MessagingService;
@@ -1442,13 +1443,30 @@ public class Gossiper implements
IFailureDetectionEventListener, GossiperMBean,
{
Message<NoPayload> echoMessage = Message.out(ECHO_REQ, noPayload);
logger.trace("Sending ECHO_REQ to {}", addr);
- RequestCallback echoHandler = msg ->
+ RequestCallback echoHandler = new RequestCallback()
{
- runInGossipStageBlocking(() -> {
- EndpointState epState = inflightEcho.remove(addr);
- if (epState != null)
- realMarkAlive(addr, epState);
- });
+ @Override
+ public void onResponse(Message msg)
+ {
+ runInGossipStageBlocking(() -> {
+ EndpointState epState = inflightEcho.remove(addr);
+ if (epState != null)
+ realMarkAlive(addr, epState);
+ });
+ }
+
+ @Override
+ public boolean invokeOnFailure()
+ {
+ return true;
+ }
+
+ @Override
+ public void onFailure(InetAddressAndPort from,
RequestFailureReason failureReason)
+ {
+ logger.trace("ECHO_REQ to {} failed ({})", addr,
failureReason);
+ inflightEcho.remove(addr);
+ }
};
MessagingService.instance().sendWithCallback(echoMessage, addr,
echoHandler);
} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]