[ 
https://issues.apache.org/jira/browse/CASSANDRA-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086362#comment-18086362
 ] 

Stefan Miklosovic commented on CASSANDRA-21428:
-----------------------------------------------

6.0 https://pre-ci.cassandra.apache.org/job/cassandra-6.0/48/#showFailuresLink

> Nodes can become stuck in DOWN if ECHO_REQ has timeout
> ------------------------------------------------------
>
>                 Key: CASSANDRA-21428
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21428
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Cameron Zemek
>            Assignee: Cameron Zemek
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 6.x, 7.x
>
>
> In Gossiper, echoHandler only implements onResponse. 
> RequestCallback.onFailure has a default no-op, so when the ECHO_REQ times out 
> or the remote node returns an error, inflightEcho.remove(addr) is never 
> called. The stale entry persists. Any subsequent markAlive(addr, localState) 
> call — where localState is the same in-place-mutated object already in 
> inflightEcho — sees localState.equals(prevState) = true (identity equality, 
> same reference) and skips indefinitely. In a temporary-partition scenario 
> (node briefly unreachable, echo times out, node recovers with the same 
> generation), the node can get stuck permanently dead: the failure detector 
> sees it as alive and keeps triggering markAlive, but every invocation is 
> suppressed by the stale entry. The stale entry is only cleared by 
> removeEndpoint() (explicit removal) or silentlyMarkDead() via markDead() 
> (failure detector conviction) — neither of which fires if the failure 
> detector is reporting the node as healthy.
> Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).
> {code:java}
> diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java 
> b/src/java/org/apache/cassandra/gms/Gossiper.java
> index 647441ffba..6d046cd8ba 100644
> --- a/src/java/org/apache/cassandra/gms/Gossiper.java
> +++ b/src/java/org/apache/cassandra/gms/Gossiper.java
> @@ -65,6 +65,7 @@ import 
> org.apache.cassandra.config.CassandraRelevantProperties;
>  import org.apache.cassandra.config.DatabaseDescriptor;
>  import org.apache.cassandra.db.SystemKeyspace;
>  import org.apache.cassandra.dht.Token;
> +import org.apache.cassandra.exceptions.RequestFailureReason;
>  import org.apache.cassandra.locator.InetAddressAndPort;
>  import org.apache.cassandra.net.Message;
>  import org.apache.cassandra.net.MessagingService;
> @@ -1442,13 +1443,30 @@ public class Gossiper implements 
> IFailureDetectionEventListener, GossiperMBean,
>          {
>              Message<NoPayload> echoMessage = Message.out(ECHO_REQ, 
> noPayload);
>              logger.trace("Sending ECHO_REQ to {}", addr);
> -            RequestCallback echoHandler = msg ->
> +            RequestCallback echoHandler = new RequestCallback()
>              {
> -                runInGossipStageBlocking(() -> {
> -                    EndpointState epState = inflightEcho.remove(addr);
> -                    if (epState != null)
> -                        realMarkAlive(addr, epState);
> -                });
> +                @Override
> +                public void onResponse(Message msg)
> +                {
> +                    runInGossipStageBlocking(() -> {
> +                        EndpointState epState = inflightEcho.remove(addr);
> +                        if (epState != null)
> +                            realMarkAlive(addr, epState);
> +                    });
> +                }
> +
> +                @Override
> +                public boolean invokeOnFailure()
> +                {
> +                    return true;
> +                }
> +
> +                @Override
> +                public void onFailure(InetAddressAndPort from, 
> RequestFailureReason failureReason)
> +                {
> +                    logger.trace("ECHO_REQ to {} failed ({})", addr, 
> failureReason);
> +                    inflightEcho.remove(addr);
> +                }
>              };
>              MessagingService.instance().sendWithCallback(echoMessage, addr, 
> echoHandler);
>          } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to