[
https://issues.apache.org/jira/browse/CASSANDRA-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086349#comment-18086349
]
Stefan Miklosovic commented on CASSANDRA-21428:
-----------------------------------------------
[CASSANDRA-21428-4.0|https://github.com/instaclustr/cassandra/tree/CASSANDRA-21428-4.0]
{noformat}
java8_pre-commit_tests
✓ j8_build 3m 0s
✓ j8_cqlsh-dtests-py2-no-vnodes 6m 4s
✓ j8_cqlsh-dtests-py2-with-vnodes 6m 3s
✓ j8_cqlsh_dtests_py3 5m 51s
✓ j8_cqlsh_dtests_py38 6m 6s
✓ j8_cqlsh_dtests_py38_vnode 7m 2s
✓ j8_cqlsh_dtests_py3_vnode 6m 10s
✓ j8_cqlshlib_tests 7m 6s
✓ j8_dtests_vnode 43m 30s
✓ j8_jvm_dtests 14m 0s
✓ j11_unit_tests 9m 39s
✓ j11_dtests_vnode 43m 44s
✓ j11_cqlsh_dtests_py3_vnode 5m 53s
✓ j11_cqlsh_dtests_py38_vnode 6m 5s
✓ j11_cqlsh_dtests_py38 6m 5s
✓ j11_cqlsh_dtests_py311_vnode 6m 11s
✓ j11_cqlsh_dtests_py311 6m 5s
✓ j11_cqlsh_dtests_py3 6m 23s
✓ j11_cqlsh-dtests-py2-with-vnodes 5m 55s
✓ j11_cqlsh-dtests-py2-no-vnodes 6m 3s
✕ j8_cqlsh_dtests_py311 1m 13s
✕ j8_cqlsh_dtests_py311_vnode 1m 52s
✕ j8_dtests 56m 1s
refresh_test.TestRefresh test_refresh_deadlock_startup
✕ j8_unit_tests 8m 38s
org.apache.cassandra.index.sasi.SASICQLTest testPagingWithClustering
✕ j8_utests_system_keyspace_directory 9m 15s
org.apache.cassandra.net.ConnectionTest testMessageDeliveryOnReconnect
✕ j11_dtests 56m 42s
refresh_test.TestRefresh test_refresh_deadlock_startup
{noformat}
[java8_pre-commit_tests|https://app.circleci.com/pipelines/github/instaclustr/cassandra/6477/workflows/0de66a77-e8de-44d1-9627-8af8c20a327f]
> Nodes can become stuck in DOWN if ECHO_REQ has timeout
> ------------------------------------------------------
>
> Key: CASSANDRA-21428
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21428
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Cameron Zemek
> Assignee: Cameron Zemek
> Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 6.x, 7.x
>
>
> In Gossiper, echoHandler only implements onResponse.
> RequestCallback.onFailure has a default no-op, so when the ECHO_REQ times out
> or the remote node returns an error, inflightEcho.remove(addr) is never
> called. The stale entry persists. Any subsequent markAlive(addr, localState)
> call — where localState is the same in-place-mutated object already in
> inflightEcho — sees localState.equals(prevState) = true (identity equality,
> same reference) and skips indefinitely. In a temporary-partition scenario
> (node briefly unreachable, echo times out, node recovers with the same
> generation), the node can get stuck permanently dead: the failure detector
> sees it as alive and keeps triggering markAlive, but every invocation is
> suppressed by the stale entry. The stale entry is only cleared by
> removeEndpoint() (explicit removal) or silentlyMarkDead() via markDead()
> (failure detector conviction) — neither of which fires if the failure
> detector is reporting the node as healthy.
> Fix: override onFailure in echoHandler to call inflightEcho.remove(addr).
> {code:java}
> diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java
> b/src/java/org/apache/cassandra/gms/Gossiper.java
> index 647441ffba..6d046cd8ba 100644
> --- a/src/java/org/apache/cassandra/gms/Gossiper.java
> +++ b/src/java/org/apache/cassandra/gms/Gossiper.java
> @@ -65,6 +65,7 @@ import
> org.apache.cassandra.config.CassandraRelevantProperties;
> import org.apache.cassandra.config.DatabaseDescriptor;
> import org.apache.cassandra.db.SystemKeyspace;
> import org.apache.cassandra.dht.Token;
> +import org.apache.cassandra.exceptions.RequestFailureReason;
> import org.apache.cassandra.locator.InetAddressAndPort;
> import org.apache.cassandra.net.Message;
> import org.apache.cassandra.net.MessagingService;
> @@ -1442,13 +1443,30 @@ public class Gossiper implements
> IFailureDetectionEventListener, GossiperMBean,
> {
> Message<NoPayload> echoMessage = Message.out(ECHO_REQ,
> noPayload);
> logger.trace("Sending ECHO_REQ to {}", addr);
> - RequestCallback echoHandler = msg ->
> + RequestCallback echoHandler = new RequestCallback()
> {
> - runInGossipStageBlocking(() -> {
> - EndpointState epState = inflightEcho.remove(addr);
> - if (epState != null)
> - realMarkAlive(addr, epState);
> - });
> + @Override
> + public void onResponse(Message msg)
> + {
> + runInGossipStageBlocking(() -> {
> + EndpointState epState = inflightEcho.remove(addr);
> + if (epState != null)
> + realMarkAlive(addr, epState);
> + });
> + }
> +
> + @Override
> + public boolean invokeOnFailure()
> + {
> + return true;
> + }
> +
> + @Override
> + public void onFailure(InetAddressAndPort from,
> RequestFailureReason failureReason)
> + {
> + logger.trace("ECHO_REQ to {} failed ({})", addr,
> failureReason);
> + inflightEcho.remove(addr);
> + }
> };
> MessagingService.instance().sendWithCallback(echoMessage, addr,
> echoHandler);
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]