[
https://issues.apache.org/jira/browse/CASSANDRA-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528283#comment-17528283
]
Brandon Williams edited comment on CASSANDRA-17566 at 4/26/22 4:41 PM:
-----------------------------------------------------------------------
I haven't been able to reproduce this exact error, but it looks like it could
be a timeout, and timeouts I can reproduce with this test on a machine with
plenty of resources. I went back to when this test was committed, and this
behavior has been there since inception:
{noformat}
[junit-timeout] Testcase:
org.apache.cassandra.distributed.test.repair.ForceRepairTest:forceWithDifference:
Caused an ERROR
[junit-timeout] Timeout occurred. Please note the time in the report does not
reflect the time until the timeout.
[junit-timeout] junit.framework.AssertionFailedError: Timeout occurred. Please
note the time in the report does not reflect the time until the timeout.
[junit-timeout] at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout] at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout]
[junit-timeout]
[junit-timeout] Test
org.apache.cassandra.distributed.test.repair.ForceRepairTest FAILED (timeout)
{noformat}
I added some logging around the [args
loop|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/repair/ForceRepairTest.java#L83],
which shows that sometimes there's a delay in the [nodetool
failure|https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/repair/ForceRepairTest.java#L95]:
{noformat}
INFO [main] <main> 2022-04-26 10:41:32,673 beginning --full
INFO [main] <main> 2022-04-26 10:41:32,686 failure finished
INFO [main] <main> 2022-04-26 10:41:32,723 success finished
INFO [main] <main> 2022-04-26 10:41:32,724 finished --full
INFO [main] <main> 2022-04-26 10:41:32,724 beginning --preview
INFO [main] <main> 2022-04-26 10:46:32,734 failure finished
INFO [main] <main> 2022-04-26 10:46:32,794 success finished
INFO [main] <main> 2022-04-26 10:46:32,796 finished --preview
INFO [main] <main> 2022-04-26 10:46:32,796 beginning --validate
INFO [main] <main> 2022-04-26 10:51:32,803 failure finished
INFO [main] <main> 2022-04-26 10:51:32,840 success finished
INFO [main] <main> 2022-04-26 10:51:32,848 finished --validate
{noformat}
which is interestingly always 5 minutes. Here is node1's relevant log for the
preview at 10:41:32,724:
{noformat}
DEBUG [node1_Repair#6:1] node1 2022-04-26 10:41:32,726 [preview repair
#5906ca00-c577-11ec-b392-2548bbd24495] session task executor shut down
gracefully
INFO [node1_Repair-Task:1] node1 2022-04-26 10:41:32,732 Starting repair
command #7 (590b0fc0-c577-11ec-b392-2548bbd24495), repairing keyspace
distributed_test_keyspace with repair options (parallelism: parallel, primary
range: false, incremental: true, job threads: 1, ColumnFamilies: [],
dataCenters: [], hosts: [], previewKind: UNREPAIRED, # of ranges: 3, pull
repair: false, force repair: false, optimise streams: false, ignore
unreplicated keyspaces: false, repairPaxos: false, paxosOnly: false)
ERROR [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 Repair
590b0fc0-c577-11ec-b392-2548bbd24495 failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.2:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:726)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:652)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:401)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:280)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:249)
at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 [preview repair
#590b0fc0-c577-11ec-b392-2548bbd24495]Repair command #7 finished with error
{noformat}
which indicates it should have returned the failure at 10:41:32,733, but this
doesn't happen for 5 minutes, and the only thing in node1's log is:
{noformat}
INFO [node1_Messaging-EventLoop-3-1] node1 2022-04-26 10:42:01,657
/127.0.0.1:7012->/127.0.0.2:7012-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused:
/127.0.0.2:7012
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
{noformat}
until the next test begins. This feels like perhaps something in the dtest
framework to me but I'm not sure. WDYT [~dcapwell] ?
was (Author: brandon.williams):
I haven't been able to reproduce this exact error, but it looks like it could
be a timeout, and timeouts I can reproduce with this test on a machine with
plenty of resources. I went back to when this test was committed, and this
behavior has been there since inception:
{noformat}
[junit-timeout] Testcase:
org.apache.cassandra.distributed.test.repair.ForceRepairTest:forceWithDifference:
Caused an ERROR
[junit-timeout] Timeout occurred. Please note the time in the report does not
reflect the time until the timeout.
[junit-timeout] junit.framework.AssertionFailedError: Timeout occurred. Please
note the time in the report does not reflect the time until the timeout.
[junit-timeout] at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout] at java.util.Vector.forEach(Vector.java:1277)
[junit-timeout]
[junit-timeout]
[junit-timeout] Test
org.apache.cassandra.distributed.test.repair.ForceRepairTest FAILED (timeout)
{noformat}
I added some logging around the [args loop], which shows that sometimes there's
a delay in the ]nodetool failure]:
{noformat}
INFO [main] <main> 2022-04-26 10:41:32,673 beginning --full
INFO [main] <main> 2022-04-26 10:41:32,686 failure finished
INFO [main] <main> 2022-04-26 10:41:32,723 success finished
INFO [main] <main> 2022-04-26 10:41:32,724 finished --full
INFO [main] <main> 2022-04-26 10:41:32,724 beginning --preview
INFO [main] <main> 2022-04-26 10:46:32,734 failure finished
INFO [main] <main> 2022-04-26 10:46:32,794 success finished
INFO [main] <main> 2022-04-26 10:46:32,796 finished --preview
INFO [main] <main> 2022-04-26 10:46:32,796 beginning --validate
INFO [main] <main> 2022-04-26 10:51:32,803 failure finished
INFO [main] <main> 2022-04-26 10:51:32,840 success finished
INFO [main] <main> 2022-04-26 10:51:32,848 finished --validate
{noformat}
which is interestingly always 5 minutes. Here is node1's relevant log for the
preview at 10:41:32,724:
{noformat}
DEBUG [node1_Repair#6:1] node1 2022-04-26 10:41:32,726 [preview repair
#5906ca00-c577-11ec-b392-2548bbd24495] session task executor shut down
gracefully
INFO [node1_Repair-Task:1] node1 2022-04-26 10:41:32,732 Starting repair
command #7 (590b0fc0-c577-11ec-b392-2548bbd24495), repairing keyspace
distributed_test_keyspace with repair options (parallelism: parallel, primary
range: false, incremental: true, job threads: 1, ColumnFamilies: [],
dataCenters: [], hosts: [], previewKind: UNREPAIRED, # of ranges: 3, pull
repair: false, force repair: false, optimise streams: false, ignore
unreplicated keyspaces: false, repairPaxos: false, paxosOnly: false)
ERROR [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 Repair
590b0fc0-c577-11ec-b392-2548bbd24495 failed:
java.lang.RuntimeException: Endpoint not alive: /127.0.0.2:7012
at
org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:726)
at
org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:652)
at
org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:401)
at
org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:280)
at
org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:249)
at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)
at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)
at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [node1_Repair-Task:1] node1 2022-04-26 10:41:32,733 [preview repair
#590b0fc0-c577-11ec-b392-2548bbd24495]Repair command #7 finished with error
{noformat}
which indicates it should have returned the failure at 10:41:32,733, but this
doesn't happen for 5 minutes, and the only thing in node1's log is:
{noformat}
INFO [node1_Messaging-EventLoop-3-1] node1 2022-04-26 10:42:01,657
/127.0.0.1:7012->/127.0.0.2:7012-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused:
/127.0.0.2:7012
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
{noformat}
until the next test begins. This feels like perhaps something in the dtest
framework to me but I'm not sure. WDYT [~dcapwell] ?
> Fix flaky test -
> org.apache.cassandra.distributed.test.repair.ForceRepairTest.force
> -----------------------------------------------------------------------------------
>
> Key: CASSANDRA-17566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17566
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Brandon Williams
> Assignee: Brandon Williams
> Priority: Normal
> Fix For: 4.x
>
>
> Seen on jenkins here:
> [https://ci-cassandra.apache.org/job/Cassandra-trunk/1083/testReport/org.apache.cassandra.distributed.test.repair/ForceRepairTest/force_2/]
>
> and circle here:
> https://app.circleci.com/pipelines/github/driftx/cassandra/440/workflows/42f936c7-2ede-4fbf-957c-5fb4e461dd90/jobs/5160/tests#failed-test-1
> {noformat}
> junit.framework.AssertionFailedError: nodetool command [repair,
> distributed_test_keyspace, --force, --full] was not successful
> stdout:
> [2022-04-20 15:11:01,402] Starting repair command #2
> (1701a090-c0bc-11ec-9898-07c796ce6a49), repairing keyspace
> distributed_test_keyspace with repair options (parallelism: parallel, primary
> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
> dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 3, pull repair:
> false, force repair: true, optimise streams: false, ignore unreplicated
> keyspaces: false, repairPaxos: true, paxosOnly: false)
> [2022-04-20 15:11:11,406] Repair command #2 failed with error Did not get
> replies from all endpoints.
> [2022-04-20 15:11:11,408] Repair command #2 finished with error
> stderr:
> error: Repair job has failed with the error message: Repair command #2 failed
> with error Did not get replies from all endpoints.. Check the logs on the
> repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> Repair command #2 failed with error Did not get replies from all endpoints..
> Check the logs on the repair participants for further details
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
> at
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
> at
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]