[ 
https://issues.apache.org/jira/browse/CASSANDRA-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545588#comment-17545588
 ] 

David Capwell commented on CASSANDRA-17566:
-------------------------------------------

5m feels weird, but likely client... see 
org.apache.cassandra.tools.NodeProbe#JMX_NOTIFICATION_POLL_INTERVAL_SECONDS

Server:
{code}
if (!prepareLatch.await(getRpcTimeout(MILLISECONDS), MILLISECONDS) || 
timeouts.get() > 0)
  failRepair(parentRepairSession, "Did not get replies from all endpoints.");
{code}

our timeout should be based off RPC timeout, or "request_timeout" which 
defaults too 10s.  Here is the list of files which touch this timeout value

{code}
 $ grep -r request_timeout test/distributed/ | awk -F: '{print $1}' | sort -u
test/distributed//org/apache/cassandra/distributed/test/CASAddTest.java
test/distributed//org/apache/cassandra/distributed/test/CASContentionTest.java
test/distributed//org/apache/cassandra/distributed/test/CASMultiDCTest.java
test/distributed//org/apache/cassandra/distributed/test/CASTest.java
test/distributed//org/apache/cassandra/distributed/test/CasCriticalSectionTest.java
test/distributed//org/apache/cassandra/distributed/test/LargeColumnTest.java
test/distributed//org/apache/cassandra/distributed/test/LegacyCASTest.java
test/distributed//org/apache/cassandra/distributed/test/MessageFiltersTest.java
test/distributed//org/apache/cassandra/distributed/test/PaxosRepairTest.java
test/distributed//org/apache/cassandra/distributed/test/PaxosRepairTest2.java
test/distributed//org/apache/cassandra/distributed/test/ReadRepairEmptyRangeTombstonesTest.java
test/distributed//org/apache/cassandra/distributed/test/ReadRepairQueryTester.java
test/distributed//org/apache/cassandra/distributed/test/ReadRepairTest.java
test/distributed//org/apache/cassandra/distributed/test/ring/ReadsDuringBootstrapTest.java
test/distributed//org/apache/cassandra/distributed/upgrade/MixedModeAvailabilityTestBase.java
test/distributed//org/apache/cassandra/distributed/upgrade/MixedModeConsistencyTestBase.java
test/distributed//org/apache/cassandra/distributed/upgrade/MixedModeMessageForwardTest.java
{code}

ForceRepairTest isn't in that list, and it exists TestBaseImpl which also isn't 
in the list; and the jvm-dtest code doesn't either... I can't explain why 5m...


bq. which indicates it should have returned the failure at 10:41:32,733, but 
this doesn't happen for 5 minutes

I agree, it should fail around that time, did the client side assert match or 
was it 5m delayed?  I am mostly asking if the cluster took 5m to shutdown or 
did it take 5m for the client to notice?  
org.apache.cassandra.tools.NodeProbe#JMX_NOTIFICATION_POLL_INTERVAL_SECONDS 
defaults to 5m which means a JMX message was dropped (jmx is lossy), so client 
noticing repair failed 5m later makes sense as that matches our poll timeout 
logic; lowering -Dcassandra.nodetool.jmx_notification_poll_interval_seconds 
would speed up those checks

{code}
stderr:
error: Repair job has failed with the error message: Repair command #2 failed 
with error Did not get replies from all endpoints.. Check the logs on the 
repair participants for further details
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
Repair command #2 failed with error Did not get replies from all endpoints.. 
Check the logs on the repair participants for further details
        at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
        at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at 
javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
        at 
javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
        at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
{code}

If we lower the client poll logic the test should fail faster, but it should 
still fail.  It looks like a connection issue broke prepare when it wasn't 
expected, and atm we do not have retry logic in repair (known issue), so 
failing would be expected behavior atm.

> Fix flaky test - 
> org.apache.cassandra.distributed.test.repair.ForceRepairTest.force
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17566
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17566
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.1-beta, 4.x
>
>
> Seen on jenkins here: 
> [https://ci-cassandra.apache.org/job/Cassandra-trunk/1083/testReport/org.apache.cassandra.distributed.test.repair/ForceRepairTest/force_2/]
>  
> and circle here:
> https://app.circleci.com/pipelines/github/driftx/cassandra/440/workflows/42f936c7-2ede-4fbf-957c-5fb4e461dd90/jobs/5160/tests#failed-test-1
> {noformat}
> junit.framework.AssertionFailedError: nodetool command [repair, 
> distributed_test_keyspace, --force, --full] was not successful
> stdout:
> [2022-04-20 15:11:01,402] Starting repair command #2 
> (1701a090-c0bc-11ec-9898-07c796ce6a49), repairing keyspace 
> distributed_test_keyspace with repair options (parallelism: parallel, primary 
> range: false, incremental: false, job threads: 1, ColumnFamilies: [], 
> dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 3, pull repair: 
> false, force repair: true, optimise streams: false, ignore unreplicated 
> keyspaces: false, repairPaxos: true, paxosOnly: false)
> [2022-04-20 15:11:11,406] Repair command #2 failed with error Did not get 
> replies from all endpoints.
> [2022-04-20 15:11:11,408] Repair command #2 finished with error
> stderr:
> error: Repair job has failed with the error message: Repair command #2 failed 
> with error Did not get replies from all endpoints.. Check the logs on the 
> repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> Repair command #2 failed with error Did not get replies from all endpoints.. 
> Check the logs on the repair participants for further details
>       at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>       at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>       at 
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>       at 
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>       at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to