[jira] [Comment Edited] (CASSANDRA-17566) Fix flaky test - org.apache.cassandra.distributed.test.repair.ForceRepairTest.force

Brandon Williams (Jira) Tue, 26 Apr 2022 09:42:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528283#comment-17528283
 ]


Brandon Williams edited comment on CASSANDRA-17566 at 4/26/22 4:41 PM:
-----------------------------------------------------------------------

I haven't been able to reproduce this exact error, but it looks like it could 
be a timeout, and timeouts I can reproduce with this test on a machine with 
plenty of resources.  I went back to when this test was committed, and this 
behavior has been there since inception:

{noformat}
[junit-timeout] Testcase: 
org.apache.cassandra.distributed.test.repair.ForceRepairTest:forceWithDifference:
     Caused an ERROR
[junit-timeout] Timeout occurred. Please note the time in the report does not 
reflect the time until the timeout.
[junit-timeout] junit.framework.AssertionFailedError: Timeout occurred. Please 
note the time in the report does not reflect the time until the timeout.
[junit-timeout]         at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout]         at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout] 
[junit-timeout] 
[junit-timeout] Test 
org.apache.cassandra.distributed.test.repair.ForceRepairTest FAILED (timeout)
{noformat}

I added some logging around the [args 
loop|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/repair/ForceRepairTest.java#L83],
 which shows that sometimes there's a delay in the [nodetool 
failure|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/repair/ForceRepairTest.java#L95]:

{noformat}
INFO  [main] <main> 2022-04-26 10:41:32,673 beginning --full
INFO  [main] <main> 2022-04-26 10:41:32,686 failure finished
INFO  [main] <main> 2022-04-26 10:41:32,723 success finished
INFO  [main] <main> 2022-04-26 10:41:32,724 finished --full
INFO  [main] <main> 2022-04-26 10:41:32,724 beginning --preview
INFO  [main] <main> 2022-04-26 10:46:32,734 failure finished
INFO  [main] <main> 2022-04-26 10:46:32,794 success finished
INFO  [main] <main> 2022-04-26 10:46:32,796 finished --preview
INFO  [main] <main> 2022-04-26 10:46:32,796 beginning --validate
INFO  [main] <main> 2022-04-26 10:51:32,803 failure finished
INFO  [main] <main> 2022-04-26 10:51:32,840 success finished
INFO  [main] <main> 2022-04-26 10:51:32,848 finished --validate
{noformat}
 
 which is interestingly always 5 minutes.  Here is node1's relevant log for the 
preview at 10:41:32,724:

{noformat}
DEBUG [node1_Repair#6:1] node1 2022-04-26 10:41:32,726 [preview repair 
#5906ca00-c577-11ec-b392-2548bbd24495] session task executor shut down 
gracefully
INFO  [node1_Repair-Task:1] node1 2022-04-26 10:41:32,732 Starting repair 
command #7 (590b0fc0-c577-11ec-b392-2548bbd24495), repairing keyspace 
distributed_test_keyspace with repair options (parallelism: parallel, primary 
range: false, incremental: true, job threads: 1, ColumnFamilies: [], 
dataCenters: [], hosts: [], previewKind: UNREPAIRED, # of ranges: 3, pull 
repair: false, force repair: false, optimise streams: false, ignore 
unreplicated keyspaces: false, repairPaxos: false, paxosOnly: false)
ERROR [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 Repair 
590b0fc0-c577-11ec-b392-2548bbd24495 failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.2:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:726)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:652)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:401)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:280)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:249)
        at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
        at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
INFO  [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 [preview repair 
#590b0fc0-c577-11ec-b392-2548bbd24495]Repair command #7 finished with error
{noformat}

which indicates it should have returned the failure at 10:41:32,733, but this 
doesn't happen for 5 minutes, and the only thing in node1's log is:

{noformat}
INFO  [node1_Messaging-EventLoop-3-1] node1 2022-04-26 10:42:01,657 
/127.0.0.1:7012->/127.0.0.2:7012-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/127.0.0.2:7012
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

until the next test begins. This feels like perhaps something in the dtest 
framework to me but I'm not sure.  WDYT [~dcapwell] ?


was (Author: brandon.williams):
I haven't been able to reproduce this exact error, but it looks like it could 
be a timeout, and timeouts I can reproduce with this test on a machine with 
plenty of resources.  I went back to when this test was committed, and this 
behavior has been there since inception:

{noformat}
[junit-timeout] Testcase: 
org.apache.cassandra.distributed.test.repair.ForceRepairTest:forceWithDifference:
     Caused an ERROR
[junit-timeout] Timeout occurred. Please note the time in the report does not 
reflect the time until the timeout.
[junit-timeout] junit.framework.AssertionFailedError: Timeout occurred. Please 
note the time in the report does not reflect the time until the timeout.
[junit-timeout]         at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout]         at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout] 
[junit-timeout] 
[junit-timeout] Test 
org.apache.cassandra.distributed.test.repair.ForceRepairTest FAILED (timeout)
{noformat}

I added some logging around the [args loop], which shows that sometimes there's 
a delay in the ]nodetool failure]:

{noformat}
INFO  [main] <main> 2022-04-26 10:41:32,673 beginning --full
INFO  [main] <main> 2022-04-26 10:41:32,686 failure finished
INFO  [main] <main> 2022-04-26 10:41:32,723 success finished
INFO  [main] <main> 2022-04-26 10:41:32,724 finished --full
INFO  [main] <main> 2022-04-26 10:41:32,724 beginning --preview
INFO  [main] <main> 2022-04-26 10:46:32,734 failure finished
INFO  [main] <main> 2022-04-26 10:46:32,794 success finished
INFO  [main] <main> 2022-04-26 10:46:32,796 finished --preview
INFO  [main] <main> 2022-04-26 10:46:32,796 beginning --validate
INFO  [main] <main> 2022-04-26 10:51:32,803 failure finished
INFO  [main] <main> 2022-04-26 10:51:32,840 success finished
INFO  [main] <main> 2022-04-26 10:51:32,848 finished --validate
{noformat}
 
 which is interestingly always 5 minutes.  Here is node1's relevant log for the 
preview at 10:41:32,724:

{noformat}
DEBUG [node1_Repair#6:1] node1 2022-04-26 10:41:32,726 [preview repair 
#5906ca00-c577-11ec-b392-2548bbd24495] session task executor shut down 
gracefully
INFO  [node1_Repair-Task:1] node1 2022-04-26 10:41:32,732 Starting repair 
command #7 (590b0fc0-c577-11ec-b392-2548bbd24495), repairing keyspace 
distributed_test_keyspace with repair options (parallelism: parallel, primary 
range: false, incremental: true, job threads: 1, ColumnFamilies: [], 
dataCenters: [], hosts: [], previewKind: UNREPAIRED, # of ranges: 3, pull 
repair: false, force repair: false, optimise streams: false, ignore 
unreplicated keyspaces: false, repairPaxos: false, paxosOnly: false)
ERROR [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 Repair 
590b0fc0-c577-11ec-b392-2548bbd24495 failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.2:7012
        at 
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:726)
        at 
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:652)
        at 
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:401)
        at 
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:280)
        at 
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:249)
        at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
        at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
INFO  [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 [preview repair 
#590b0fc0-c577-11ec-b392-2548bbd24495]Repair command #7 finished with error
{noformat}

which indicates it should have returned the failure at 10:41:32,733, but this 
doesn't happen for 5 minutes, and the only thing in node1's log is:

{noformat}
INFO  [node1_Messaging-EventLoop-3-1] node1 2022-04-26 10:42:01,657 
/127.0.0.1:7012->/127.0.0.2:7012-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/127.0.0.2:7012
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

until the next test begins. This feels like perhaps something in the dtest 
framework to me but I'm not sure.  WDYT [~dcapwell] ?

> Fix flaky test - 
> org.apache.cassandra.distributed.test.repair.ForceRepairTest.force
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17566
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17566
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.x
>
>
> Seen on jenkins here: 
> [https://ci-cassandra.apache.org/job/Cassandra-trunk/1083/testReport/org.apache.cassandra.distributed.test.repair/ForceRepairTest/force_2/]
>  
> and circle here:
> https://app.circleci.com/pipelines/github/driftx/cassandra/440/workflows/42f936c7-2ede-4fbf-957c-5fb4e461dd90/jobs/5160/tests#failed-test-1
> {noformat}
> junit.framework.AssertionFailedError: nodetool command [repair, 
> distributed_test_keyspace, --force, --full] was not successful
> stdout:
> [2022-04-20 15:11:01,402] Starting repair command #2 
> (1701a090-c0bc-11ec-9898-07c796ce6a49), repairing keyspace 
> distributed_test_keyspace with repair options (parallelism: parallel, primary 
> range: false, incremental: false, job threads: 1, ColumnFamilies: [], 
> dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 3, pull repair: 
> false, force repair: true, optimise streams: false, ignore unreplicated 
> keyspaces: false, repairPaxos: true, paxosOnly: false)
> [2022-04-20 15:11:11,406] Repair command #2 failed with error Did not get 
> replies from all endpoints.
> [2022-04-20 15:11:11,408] Repair command #2 finished with error
> stderr:
> error: Repair job has failed with the error message: Repair command #2 failed 
> with error Did not get replies from all endpoints.. Check the logs on the 
> repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> Repair command #2 failed with error Did not get replies from all endpoints.. 
> Check the logs on the repair participants for further details
>       at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>       at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>       at 
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>       at 
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>       at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-17566) Fix flaky test - org.apache.cassandra.distributed.test.repair.ForceRepairTest.force

Reply via email to