Can you please try to do nodetool describecluster from every node of the cluster?
One time I noticed issue when nodetool status shows all nodes UN but describecluster was not. Thanks Surbhi On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger <joseph.obernber...@gmail.com> wrote: > Hi All - been using reaper to do repairs, but it has hung. I tried to run: > nodetool repair -pr > on each of the nodes, but they all fail with some form of this error: > > error: Repair job has failed with the error message: Repair command #521 > failed with error Did not get replies from all endpoints.. Check the > logs on the repair participants for further details > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error > message: Repair command #521 failed with error Did not get replies from > all endpoints.. Check the logs on the repair participants for further > details > at > org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137) > at > > org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) > at > > java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633) > at > > java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555) > at > > java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474) > at > > java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108) > at java.base/java.lang.Thread.run(Thread.java:829) > > Using version 4.1.2-1 > nodetool status > Datacenter: datacenter1 > ======================= > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host > ID Rack > UN 172.16.100.45 505.66 GiB 250 ? > 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1 > UN 172.16.100.251 380.75 GiB 200 ? > 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1 > UN 172.16.100.35 479.2 GiB 200 ? > 59150c47-274a-46fb-9d5e-bed468d36797 rack1 > UN 172.16.100.252 248.69 GiB 200 ? > 8f0d392f-0750-44e2-91a5-b30708ade8e4 rack1 > UN 172.16.100.249 411.53 GiB 200 ? > 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1 > UN 172.16.100.38 333.26 GiB 200 ? > 0d9509cc-2f23-4117-a883-469a1be54baf rack1 > UN 172.16.100.36 405.33 GiB 200 ? > d9702f96-256e-45ae-8e12-69a42712be50 rack1 > UN 172.16.100.39 437.74 GiB 200 ? > 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1 > UN 172.16.100.248 344.4 GiB 200 ? > 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1 > UN 172.16.100.44 409.36 GiB 200 ? > b2e5366e-8386-40ec-a641-27944a5a7cfa rack1 > UN 172.16.100.37 236.08 GiB 120 ? > 08a19658-40be-4e55-8709-812b3d4ac750 rack1 > UN 172.16.20.16 975 GiB 500 ? > 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1 > UN 172.16.100.34 340.77 GiB 200 ? > 352fd049-32f8-4be8-9275-68b145ac2832 rack1 > UN 172.16.100.42 974.86 GiB 500 ? > b088a8e6-42f3-4331-a583-47ef5149598f rack1 > > Note: Non-system keyspaces don't have the same replication settings, > effective ownership information is meaningless > > Debug log has: > > > DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 > MigrationCoordinator.java:264 - Pulling unreceived schema versions... > INFO [HintsDispatcher:11344] 2023-08-04 11:56:21,369 > HintsDispatchExecutor.java:318 - Finished hinted handoff of file > 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint > /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially > WARN > [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES] > 2023-08-04 11:56:21,916 OutboundConnection.java:491 - > /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel] > dropping message of type HINT_REQ due to error > org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The > channel this output stream was writing to has been closed > at > org.apache.cassandra.net > .AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) > at > org.apache.cassandra.net > .AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) > at > org.apache.cassandra.net > .AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) > at > org.apache.cassandra.net > .AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) > at > org.apache.cassandra.net > .AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100) > at > > org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122) > at > > org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139) > at > > org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:77) > at > org.apache.cassandra.net > .Message$Serializer.serializePost40(Message.java:844) > at > org.apache.cassandra.net.Message$Serializer.serialize(Message.java:702) > at > org.apache.cassandra.net > .OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984) > at > org.apache.cassandra.net > .OutboundConnection$Delivery.run(OutboundConnection.java:690) > at > org.apache.cassandra.net > .OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:958) > at > > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124) > at > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: io.netty.channel.unix.Errors$NativeIoException: > writeAddress(..) failed: Connection timed out > INFO [Messaging-EventLoop-3-16] 2023-08-04 11:56:21,918 > OutboundConnection.java:1153 - > /172.16.100.34:7000(/172.16.100.34:59198)->/172.16.20.16:7000-LARGE_MESSAGES-2fc2c5b9 > > successfully connected, version = 12, framing = CRC, encryption = > unencrypted > ERROR [Repair-Task:437] 2023-08-04 11:56:28,592 RepairRunnable.java:160 > - Repair 30675c00-32df-11ee-a7d8-05183c68b0d0 failed: > java.lang.RuntimeException: Did not get replies from all endpoints. > at > > org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:721) > at > > org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:654) > at > org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400) > at > > org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279) > at > org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248) > at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) > at > org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) > at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) > at > org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) > at > org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) > at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) > at > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829) > INFO [Repair-Task:437] 2023-08-04 11:56:28,594 RepairRunnable.java:223 > - [repair #30675c00-32df-11ee-a7d8-05183c68b0d0]Repair command #522 > finished with error > > What to do? > Thanks! > > -Joe > > > -- > This email has been checked for viruses by AVG antivirus software. > www.avg.com >