[ 
https://issues.apache.org/jira/browse/CASSANDRA-18440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893158#comment-17893158
 ] 

Kristijonas Zalys edited comment on CASSANDRA-18440 at 10/26/24 10:55 PM:
--------------------------------------------------------------------------

This a race condition inside [shutdown in 
Instance.java|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/impl/Instance.java#L888].
 In the test's logs you can see the following messages:
{code:java}
INFO  [node1_GossipStage:1] node1 2024-10-26 15:25:36,891 Gossiper.java:1558 - 
InetAddress /127.0.0.3:7012 is now DOWN
INFO  [node2_GossipStage:1] node2 2024-10-26 15:25:36,891 Gossiper.java:1558 - 
InetAddress /127.0.0.3:7012 is now DOWN 
...
DEBUG [node1_GossipStage:1] node1 2024-10-26 15:25:37,064 Gossiper.java:1541 - 
removing expire time for endpoint : /127.0.0.3:7012
INFO  [node1_GossipStage:1] node1 2024-10-26 15:25:37,064 Gossiper.java:1542 - 
InetAddress /127.0.0.3:7012 is now UP
...
INFO  [node1_Repair-Task:1] node1 2024-10-26 15:25:41,024 
RepairRunnable.java:325 - Starting repair command #1 ...{code}
Note that after being shut down node 3 is again marked as being UP. This is due 
to the fact that while node 3 was shutting down, node 1 sent an 
[ECHO_REQ|https://github.com/apache/cassandra/blob/c8854af03064ae894e4d69253723a1f99f001450/src/java/org/apache/cassandra/gms/Gossiper.java#L1116]
 to node 3. While handler for 
[ECHO_REQ|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/EchoVerbHandler.java#L39]
 does check if the instance is marked as shut down, the isShutdown flag in 
StorageService is [updated only inside of a JVM runtime shutdown 
hook|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L5479].
 This means that at the time when the ECHO_REQ is served, the shutdown hook has 
not yet been triggered and the isShutdown flag is still set to false. Hence 
node 3 returns a valid response to ECHO_REQ and is marked as UP by other nodes.

The [normal shutdown 
process|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L5469]
 first updates the isShutdown flag and only then announces that the node is 
shutting down. However, the dtests seem to skip a few of the steps of the 
regular shutdown process, including updating the isShutdown flag. This issue 
seems to have [already been fixed in 
trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/EchoVerbHandler.java#L40]
 by modifying the ECHO_REQ handler to check if the Gossiper has announced 
shutdown. Though the problem is still present in 5.0 and below.

I have created a PR which improves the dtest shutdown to update the isShutdown 
flag accordingly: [https://github.com/apache/cassandra/pull/3636]

In theory we could use a similar approach to what we have in trunk - modify the 
ECHO_REQ handler instead of the dtest. However, I figured that by fixing the 
dtest rather than the code-under-test we don't need to worry about breaking 
anything within the stable versions. Also, my fix and the fix we have in trunk 
are not mutually exclusive, my fix should be okay to merge into trunk without 
breaking anything.


was (Author: JIRAUSER307489):
This a race condition inside [shutdown in 
Instance.java|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/impl/Instance.java#L888].
 In the test's logs you can see the following messages:
{code:java}
INFO  [node1_GossipStage:1] node1 2024-10-26 15:25:36,891 Gossiper.java:1558 - 
InetAddress /127.0.0.3:7012 is now DOWN
INFO  [node2_GossipStage:1] node2 2024-10-26 15:25:36,891 Gossiper.java:1558 - 
InetAddress /127.0.0.3:7012 is now DOWN 
...
DEBUG [node1_GossipStage:1] node1 2024-10-26 15:25:37,064 Gossiper.java:1541 - 
removing expire time for endpoint : /127.0.0.3:7012
INFO  [node1_GossipStage:1] node1 2024-10-26 15:25:37,064 Gossiper.java:1542 - 
InetAddress /127.0.0.3:7012 is now UP
...
INFO  [node1_Repair-Task:1] node1 2024-10-26 15:25:41,024 
RepairRunnable.java:325 - Starting repair command #1 ...{code}
Note that after being shut down node 3 is again marked as being UP. This is due 
to the fact that while node 3 was shutting down, node 1 sent an 
[ECHO_REQ|https://github.com/apache/cassandra/blob/c8854af03064ae894e4d69253723a1f99f001450/src/java/org/apache/cassandra/gms/Gossiper.java#L1116].
 While handler for 
[ECHO_REQ|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/EchoVerbHandler.java#L39]
 does check if the instance is marked as shut down, the isShutdown flag in 
StorageService is [updated only inside of a JVM runtime shutdown 
hook|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L5479].
 This means that at the time when the ECHO_REQ is served, the shutdown hook has 
not yet been triggered and the isShutdown flag is still set to false. Hence 
node 3 returns a valid response to ECHO_REQ and is marked as UP by other nodes.

The [normal shutdown 
process|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L5469]
 first updates the isShutdown flag and only then announces that the node is 
shutting down. However, the dtests seem to skip a few of the steps of the 
regular shutdown process, including updating the isShutdown flag. This issue 
seems to have [already been fixed in 
trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/EchoVerbHandler.java#L40]
 by modifying the ECHO_REQ handler to check if the Gossiper has announced 
shutdown. Though the problem is still present in 5.0 and below.

I have created a PR which improves the dtest shutdown to update the isShutdown 
flag accordingly: [https://github.com/apache/cassandra/pull/3636]

In theory we could use a similar approach to what we have in trunk - modify the 
ECHO_REQ handler instead of the dtest. However, I figured that by fixing the 
dtest rather than the code-under-test we don't need to worry about breaking 
anything within the stable versions. Also, my fix and the fix we have in trunk 
are not mutually exclusive, my fix should be okay to merge into trunk without 
breaking anything.

> Test failure: 
> org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown
> ----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18440
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18440
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The test appears to be flaky as seen in 
> https://ci-cassandra.apache.org/job/Cassandra-trunk/1509/testReport/org.apache.cassandra.distributed.test/RepairTest/testForcedNormalRepairWithOneNodeDown__jdk1_8/
> and 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2329/workflows/f56b57d8-a283-4c76-968a-a2a553bf57b7/jobs/20786/tests#failed-test-0
> {code:java}
> junit.framework.AssertionFailedError: Exception found expected null, but 
> was:<java.lang.RuntimeException: Did not get replies from all endpoints.
>       at 
> org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:748)
>       at 
> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:681)
>       at 
> org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400)
>       at 
> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279)
>       at 
> org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248)
>       at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>       at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:750)
> >
>       at 
> org.apache.cassandra.distributed.test.DistributedRepairUtils.lambda$assertParentRepairSuccess$4(DistributedRepairUtils.java:129)
>       at 
> org.apache.cassandra.distributed.test.DistributedRepairUtils.validateExistingParentRepair(DistributedRepairUtils.java:164)
>       at 
> org.apache.cassandra.distributed.test.DistributedRepairUtils.assertParentRepairSuccess(DistributedRepairUtils.java:124)
>       at 
> org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown(RepairTest.java:212)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to