Hi All - been using reaper to do repairs, but it has hung.  I tried to run:
nodetool repair -pr
on each of the nodes, but they all fail with some form of this error:

error: Repair job has failed with the error message: Repair command #521 failed with error Did not get replies from all endpoints.. Check the logs on the repair participants for further details
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: Repair command #521 failed with error Did not get replies from all endpoints.. Check the logs on the repair participants for further details         at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)         at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)         at java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)         at java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)         at java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)         at java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
        at java.base/java.lang.Thread.run(Thread.java:829)

Using version 4.1.2-1
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load        Tokens  Owns  Host ID                               Rack UN  172.16.100.45   505.66 GiB  250     ? 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1 UN  172.16.100.251  380.75 GiB  200     ? 274a6e8d-de37-4e0b-b000-02d221d858a5  rack1 UN  172.16.100.35   479.2 GiB   200     ? 59150c47-274a-46fb-9d5e-bed468d36797  rack1 UN  172.16.100.252  248.69 GiB  200     ? 8f0d392f-0750-44e2-91a5-b30708ade8e4  rack1 UN  172.16.100.249  411.53 GiB  200     ? 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1 UN  172.16.100.38   333.26 GiB  200     ? 0d9509cc-2f23-4117-a883-469a1be54baf  rack1 UN  172.16.100.36   405.33 GiB  200     ? d9702f96-256e-45ae-8e12-69a42712be50  rack1 UN  172.16.100.39   437.74 GiB  200     ? 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1 UN  172.16.100.248  344.4 GiB   200     ? 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1 UN  172.16.100.44   409.36 GiB  200     ? b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1 UN  172.16.100.37   236.08 GiB  120     ? 08a19658-40be-4e55-8709-812b3d4ac750  rack1 UN  172.16.20.16    975 GiB     500     ? 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1 UN  172.16.100.34   340.77 GiB  200     ? 352fd049-32f8-4be8-9275-68b145ac2832  rack1 UN  172.16.100.42   974.86 GiB  500     ? b088a8e6-42f3-4331-a583-47ef5149598f  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Debug log has:


DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 MigrationCoordinator.java:264 - Pulling unreceived schema versions... INFO  [HintsDispatcher:11344] 2023-08-04 11:56:21,369 HintsDispatchExecutor.java:318 - Finished hinted handoff of file 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially WARN [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES] 2023-08-04 11:56:21,916 OutboundConnection.java:491 - /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel] dropping message of type HINT_REQ due to error org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel this output stream was writing to has been closed         at org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)         at org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140)         at org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97)         at org.apache.cassandra.net.AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100)         at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122)         at org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139)         at org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:77)         at org.apache.cassandra.net.Message$Serializer.serializePost40(Message.java:844)         at org.apache.cassandra.net.Message$Serializer.serialize(Message.java:702)         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984)         at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:690)         at org.apache.cassandra.net.OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:958)         at org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) failed: Connection timed out INFO  [Messaging-EventLoop-3-16] 2023-08-04 11:56:21,918 OutboundConnection.java:1153 - /172.16.100.34:7000(/172.16.100.34:59198)->/172.16.20.16:7000-LARGE_MESSAGES-2fc2c5b9 successfully connected, version = 12, framing = CRC, encryption = unencrypted ERROR [Repair-Task:437] 2023-08-04 11:56:28,592 RepairRunnable.java:160 - Repair 30675c00-32df-11ee-a7d8-05183c68b0d0 failed:
java.lang.RuntimeException: Did not get replies from all endpoints.
        at org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:721)         at org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:654)         at org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400)         at org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279)         at org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248)         at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)         at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)         at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)         at org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81)         at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47)         at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57)         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
INFO  [Repair-Task:437] 2023-08-04 11:56:28,594 RepairRunnable.java:223 - [repair #30675c00-32df-11ee-a7d8-05183c68b0d0]Repair command #522 finished with error

What to do?
Thanks!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Reply via email to