David Capwell created CASSANDRA-16585:
-----------------------------------------

             Summary: Periodic failures in *RepairCoordinator*Test
                 Key: CASSANDRA-16585
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16585
             Project: Cassandra
          Issue Type: Bug
          Components: CI, Consistency/Repair, Test/dtest/java
            Reporter: David Capwell
            Assignee: David Capwell


Periodic failures in *RepairCoordinator*Test cause errors such as

FullRepairCoordinatorNeighbourDownTest#validationParticipentCrashesAndComesBack[DATACENTER_AWARE/true]
 

{code}
nodetool command [repair, distributed_test_keyspace, 
validationparticipentcrashesandcomesback_full_datacenter_aware_true, 
--dc-parallel, --full] Error message 'Some repair failed' does not contain any 
of [/127.0.0.2:7012 died]
stdout:
[2021-04-07 22:45:24,887] Starting repair command #10 
(f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace 
distributed_test_keyspace with repair options (parallelism: dc_parallel, 
primary range: false, incremental: false, job threads: 1, ColumnFamilies: 
[validationparticipentcrashesandcomesback_full_datacenter_aware_true], 
dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: 
false, force repair: false, optimise streams: false, ignore unreplicated 
keyspaces: false)
[2021-04-07 22:45:32,864] Repair command #10 failed with error Repair session 
f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], 
(9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died
[2021-04-07 22:45:32,887] After waiting for poll interval of 1 seconds queried 
for parent session status and discovered repair failed.
[2021-04-07 22:45:32,887] Repair command #10 finished with error
[2021-04-07 22:45:32,887] Some repair failed
[2021-04-07 22:45:32,888] Repair command #10 finished with error

stderr:
error: Some repair failed
-- StackTrace --
java.io.IOException: Some repair failed
at 
org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
at 
org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
at 
org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
at 
org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)


Notifications:
Notification{type=START, src=repair:10, message=Starting repair command #10 
(f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace 
distributed_test_keyspace with repair options (parallelism: dc_parallel, 
primary range: false, incremental: false, job threads: 1, ColumnFamilies: 
[validationparticipentcrashesandcomesback_full_datacenter_aware_true], 
dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: 
false, force repair: false, optimise streams: false, ignore unreplicated 
keyspaces: false)}
Notification{type=ERROR, src=repair:10, message=Repair command #10 failed with 
error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range 
[(-1,9223372036854775805], (9223372036854775805,-1]] failed with error Endpoint 
/127.0.0.2:7012 died}
Notification{type=COMPLETE, src=repair:10, message=Repair command #10 finished 
with error}
Error:
java.io.IOException: Some repair failed
at 
org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
at 
org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
at 
org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
at 
org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}

Seems there is a race condition in nodetool repair where we query the error 
state before we get the notification, then we throw a generic error rather than 
the specific error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to