Try sstablescrub on the node where it is showing corrupted data. On Fri, Aug 11, 2023 at 8:38 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote:
> Finally found a message on another node that seem relevant: > > INFO [CompactionExecutor:7413] 2023-08-11 11:36:22,397 > CompactionTask.java:164 - Compacting (d30b64ba-385c-11ee-8e74-edf5512ad115) > [/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97958-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-91664-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90239-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99385-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101078-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-86112-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90753-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-53333-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94008-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92338-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-87273-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-82398-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94244-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-80384-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65431-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90412-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90104-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-85155-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92914-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-78344-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-53269-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99242-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73898-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-100473-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-76035-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101352-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-62093-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93643-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97812-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73062-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65491-big-Data.db:level=0, > /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93299-big-Data.db:level=0, > ] > DEBUG [CompactionExecutor:7412] 2023-08-11 11:36:22,398 > Directories.java:502 - DataDirectory /data/7/cassandra/data has 91947520000 > bytes available, checking if we can write 10716461 bytes > INFO [CompactionExecutor:7412] 2023-08-11 11:36:22,398 > CompactionTask.java:164 - Compacting (d30b64b0-385c-11ee-8e74-edf5512ad115) > [/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36867-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32270-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32287-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-30785-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32545-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38791-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38586-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36849-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-39083-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-9816-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-16383-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-17443-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-30587-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38815-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32235-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38817-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-19013-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32326-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32827-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-39106-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-42758-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32428-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-39653-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-16889-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-18940-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-41236-big-Data.db:level=0, > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36654-big-Data.db:level=0, > ] > INFO [CompactionExecutor:7412] 2023-08-11 11:36:22,398 > NoSpamLogger.java:105 - Maximum memory usage reached (512.000MiB) for > chunk-cache buffer pool, cannot allocate chunk of 8.000MiB > ERROR [CompactionExecutor:7412] 2023-08-11 11:36:23,109 > JVMStabilityInspector.java:68 - Exception in thread > Thread[CompactionExecutor:7412,5,CompactionExecutor] > org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: > /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-9816-big-Data.db > at > org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:229) > at > org.apache.cassandra.io.util.BufferManagingRebufferer.rebuffer(BufferManagingRebufferer.java:79) > at > org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:67) > at > org.apache.cassandra.io.util.RandomAccessReader.reBuffer(RandomAccessReader.java:61) > at > org.apache.cassandra.io.util.RebufferingInputStream.read(RebufferingInputStream.java:90) > at > org.apache.cassandra.io.util.RebufferingInputStream.readFully(RebufferingInputStream.java:68) > at > org.apache.cassandra.io.util.RebufferingInputStream.readFully(RebufferingInputStream.java:62) > at > org.apache.cassandra.db.marshal.ByteArrayAccessor.read(ByteArrayAccessor.java:103) > at > org.apache.cassandra.db.marshal.ByteArrayAccessor.read(ByteArrayAccessor.java:40) > at > org.apache.cassandra.db.marshal.AbstractType.read(AbstractType.java:530) > at > org.apache.cassandra.db.marshal.AbstractType.readArray(AbstractType.java:510) > at > org.apache.cassandra.db.ClusteringPrefix$Serializer.deserializeValuesWithoutSize(ClusteringPrefix.java:441) > at > org.apache.cassandra.db.Clustering$Serializer.deserialize(Clustering.java:165) > at > org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:478) > at > org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:435) > at > org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:84) > at > org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:62) > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > at > org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:126) > at > org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100) > at > org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32) > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > at > org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:376) > at > org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:188) > at > org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:157) > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > at > org.apache.cassandra.db.rows.UnfilteredRowIterators$UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:523) > at > org.apache.cassandra.db.rows.UnfilteredRowIterators$UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:391) > at > org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) > at > org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133) > at > org.apache.cassandra.db.transform.UnfilteredRows.isEmpty(UnfilteredRows.java:74) > at > org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:75) > at > org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:26) > at > org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:97) > at > org.apache.cassandra.db.compaction.CompactionIterator.hasNext(CompactionIterator.java:275) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:203) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:82) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:359) > at > org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:113) > at > org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) > at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.cassandra.io.compress.CorruptBlockException: > (/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-9816-big-Data.db): > corruption detected, chunk at 604552 of length 7911. > at > org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:221) > ... 46 common frames omitted > Caused by: org.apache.cassandra.io.compress.CorruptBlockException: > (/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-9816-big-Data.db): > corruption detected, chunk at 604552 of length 7911. > at > org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:209) > > Ideas? > > -Joe > > > On 8/7/2023 10:27 PM, manish khandelwal wrote: > > What logs of /172.16.20.16:7000 say when repair failed. It indicates > "validation failed". Can you check system.log for /172.16.20.16:7000 and > see what they say. Looks like you have some issue with *doc/origdoc, > probably some corrupt sstable. *Try to run repair for individual table > and see for which table repair fails. > > Regards > Manish > > On Mon, Aug 7, 2023 at 11:39 PM Joe Obernberger < > joseph.obernber...@gmail.com> wrote: > >> Thank you. I've tried: >> nodetool repair --full >> nodetool repair -pr >> They all get to 57% on any of the nodes, and then fail. Interestingly >> the debug log only has INFO - there are no errors. >> >> [2023-08-07 14:02:09,828] Repair command #6 failed with error Incremental >> repair session 83dc17d0-354c-11ee-809c-177460b0ed52 has failed >> [2023-08-07 14:02:09,830] Repair command #6 finished with error >> error: Repair job has failed with the error message: Repair command #6 >> failed with error Incremental repair session >> 83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the >> repair participants for further details >> -- StackTrace -- >> java.lang.RuntimeException: Repair job has failed with the error message: >> Repair command #6 failed with error Incremental repair session >> 83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the >> repair participants for further details >> at >> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137) >> at >> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108) >> at java.base/java.lang.Thread.run(Thread.java:829) >> >> Full repair results on another node: >> >> >> [2023-08-04 20:21:42,575] Repair session >> 14830280-3304-11ee-939d-635768ac938c for range >> [(-5756366402057257951,-5754159509763216479], >> (-2469484655657848961,-2461953651636879320], >> (-5175468354897450191,-5171107677178073434], >> (-628587988891618162,-624346074440106568], >> (-6615381309032691143,-6603240846496048854], >> (6616005974054228159,6628798414170514490], >> (8013321283688199900,8017115978405113835], >> (-7829682363035100161,-7824999966028871477], >> (2848484090138352114,2852114415040125826], >> (-2477015659678818602,-2469484655657848961], >> (-2483470805982506865,-2477015659678818602]] finished (progress: 57%) >> [2023-08-04 20:36:23,786] Repair session >> 14cbcb50-3304-11ee-939d-635768ac938c for range >> [(5193761311910499374,5197212898580538329], >> (-1679246469353274066,-1672836360726470435], >> (-6927245454058012407,-6922951496140109663], >> (1851771008808005661,1854683726231521039], >> (5197212898580538329,5200664485250577285], >> (1848858291384490283,1851771008808005661], >> (-4736378492502250338,-4732073287189625685], >> (-2705389975640427939,-2699099608948332293], >> (-7806270378003956741,-7796905583991499373], >> (466064862768270626,473304202405656261], >> (250549667892224144,253421473349298265], >> (-6922951496140109663,-6920804517181158291], >> (249113765163687083,250549667892224144], >> (1854683726231521039,1857596443655036418], >> (4687110928509362134,4694325991399541085], >> (-6920804517181158291,-6918657538222206919], >> (4399045818626652943,4402968741621424236], >> (473304202405656261,480543542043041896]] finished (progress: 57%) >> [2023-08-04 20:36:23,795] Repair command #12 finished with error >> error: Repair job has failed with the error message: Repair command #12 >> failed with error Repair session 154f5330-3304-11ee-939d-635768ac938c for >> range [(5333449259855342357,5338449508113440752], >> (4959134492108085445,4965331080956982133], >> (5938148666505886222,5945280202710590417], >> (8428867157147807368,8431880058869458408], >> (5338449508113440752,5343449756371539147]] failed with error [repair >> #154f5330-3304-11ee-939d-635768ac938c on doc/origdoc, >> [(5333449259855342357,5338449508113440752], >> (4959134492108085445,4965331080956982133], >> (5938148666505886222,5945280202710590417], >> (8428867157147807368,8431880058869458408], >> (5338449508113440752,5343449756371539147]]] Validation failed in / >> 172.16.20.16:7000. Check the logs on the repair participants for further >> details >> -- StackTrace -- >> java.lang.RuntimeException: Repair job has failed with the error message: >> Repair command #12 failed with error Repair session >> 154f5330-3304-11ee-939d-635768ac938c for range >> [(5333449259855342357,5338449508113440752], >> (4959134492108085445,4965331080956982133], >> (5938148666505886222,5945280202710590417], >> (8428867157147807368,8431880058869458408], >> (5338449508113440752,5343449756371539147]] failed with error [repair >> #154f5330-3304-11ee-939d-635768ac938c on doc/origdoc, >> [(5333449259855342357,5338449508113440752], >> (4959134492108085445,4965331080956982133], >> (5938148666505886222,5945280202710590417], >> (8428867157147807368,8431880058869458408], >> (5338449508113440752,5343449756371539147]]] Validation failed in / >> 172.16.20.16:7000. Check the logs on the repair participants for further >> details >> at >> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137) >> at >> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474) >> at >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108) >> at java.base/java.lang.Thread.run(Thread.java:829) >> >> I'm not sure what to do next? >> >> -Joe >> On 8/6/2023 8:58 AM, Josh McKenzie wrote: >> >> Quick drive-by observation: >> >> Did not get replies from all endpoints.. Check the >> logs on the repair participants for further details >> >> >> dropping message of type HINT_REQ due to error >> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The >> channel this output stream was writing to has been closed >> >> >> Caused by: io.netty.channel.unix.Errors$NativeIoException: >> writeAddress(..) failed: Connection timed out >> >> >> java.lang.RuntimeException: Did not get replies from all endpoints. >> >> These all point to the same shaped problem: for whatever reason, the >> coordinator of this repair didn't receive replies from the replicas >> executing it. Could be that they're dead, could be they took too long, >> could be they never got the start message, etc. Distributed operations are >> tricky like that. >> >> Logs on the replicas doing the actual repairs should give you more >> insight; this is a pretty low level generic set of errors that basically >> amounts to "we didn't hear back from the other participants in time so we >> timed out." >> >> On Fri, Aug 4, 2023, at 12:02 PM, Surbhi Gupta wrote: >> >> Can you please try to do nodetool describecluster from every node of the >> cluster? >> >> One time I noticed issue when nodetool status shows all nodes UN but >> describecluster was not. >> >> Thanks >> Surbhi >> >> On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger < >> joseph.obernber...@gmail.com> wrote: >> >> Hi All - been using reaper to do repairs, but it has hung. I tried to >> run: >> nodetool repair -pr >> on each of the nodes, but they all fail with some form of this error: >> >> error: Repair job has failed with the error message: Repair command #521 >> failed with error Did not get replies from all endpoints.. Check the >> logs on the repair participants for further details >> -- StackTrace -- >> java.lang.RuntimeException: Repair job has failed with the error >> message: Repair command #521 failed with error Did not get replies from >> all endpoints.. Check the logs on the repair participants for further >> details >> at >> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137) >> at >> >> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) >> at >> >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633) >> at >> >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555) >> at >> >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474) >> at >> >> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108) >> at java.base/java.lang.Thread.run(Thread.java:829) >> >> Using version 4.1.2-1 >> nodetool status >> Datacenter: datacenter1 >> ======================= >> Status=Up/Down >> |/ State=Normal/Leaving/Joining/Moving >> -- Address Load Tokens Owns Host >> ID Rack >> UN 172.16.100.45 505.66 GiB 250 ? >> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1 >> UN 172.16.100.251 380.75 GiB 200 ? >> 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1 >> UN 172.16.100.35 479.2 GiB 200 ? >> 59150c47-274a-46fb-9d5e-bed468d36797 rack1 >> UN 172.16.100.252 248.69 GiB 200 ? >> 8f0d392f-0750-44e2-91a5-b30708ade8e4 rack1 >> UN 172.16.100.249 411.53 GiB 200 ? >> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1 >> UN 172.16.100.38 333.26 GiB 200 ? >> 0d9509cc-2f23-4117-a883-469a1be54baf rack1 >> UN 172.16.100.36 405.33 GiB 200 ? >> d9702f96-256e-45ae-8e12-69a42712be50 rack1 >> UN 172.16.100.39 437.74 GiB 200 ? >> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1 >> UN 172.16.100.248 344.4 GiB 200 ? >> 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1 >> UN 172.16.100.44 409.36 GiB 200 ? >> b2e5366e-8386-40ec-a641-27944a5a7cfa rack1 >> UN 172.16.100.37 236.08 GiB 120 ? >> 08a19658-40be-4e55-8709-812b3d4ac750 rack1 >> UN 172.16.20.16 975 GiB 500 ? >> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1 >> UN 172.16.100.34 340.77 GiB 200 ? >> 352fd049-32f8-4be8-9275-68b145ac2832 rack1 >> UN 172.16.100.42 974.86 GiB 500 ? >> b088a8e6-42f3-4331-a583-47ef5149598f rack1 >> >> Note: Non-system keyspaces don't have the same replication settings, >> effective ownership information is meaningless >> >> Debug log has: >> >> >> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 >> MigrationCoordinator.java:264 - Pulling unreceived schema versions... >> INFO [HintsDispatcher:11344] 2023-08-04 11:56:21,369 >> HintsDispatchExecutor.java:318 - Finished hinted handoff of file >> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint >> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially >> WARN >> [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES] >> 2023-08-04 11:56:21,916 OutboundConnection.java:491 - >> /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel] >> dropping message of type HINT_REQ due to error >> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The >> channel this output stream was writing to has been closed >> at >> org.apache.cassandra.net >> .AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200) >> at >> org.apache.cassandra.net >> .AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158) >> at >> org.apache.cassandra.net >> .AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140) >> at >> org.apache.cassandra.net >> .AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97) >> at >> org.apache.cassandra.net >> .AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100) >> at >> >> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122) >> at >> >> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139) >> at >> >> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:77) >> at >> org.apache.cassandra.net >> .Message$Serializer.serializePost40(Message.java:844) >> at >> org.apache.cassandra.net.Message$Serializer.serialize(Message.java:702) >> at >> org.apache.cassandra.net >> .OutboundConnection$LargeMessageDelivery.doRun(OutboundConnection.java:984) >> at >> org.apache.cassandra.net >> .OutboundConnection$Delivery.run(OutboundConnection.java:690) >> at >> org.apache.cassandra.net >> .OutboundConnection$LargeMessageDelivery.run(OutboundConnection.java:958) >> at >> >> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124) >> at >> >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> at >> >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> at >> >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Thread.java:829) >> Caused by: io.netty.channel.unix.Errors$NativeIoException: >> writeAddress(..) failed: Connection timed out >> INFO [Messaging-EventLoop-3-16] 2023-08-04 11:56:21,918 >> OutboundConnection.java:1153 - >> /172.16.100.34:7000(/172.16.100.34:59198)->/172.16.20.16:7000-LARGE_MESSAGES-2fc2c5b9 >> >> successfully connected, version = 12, framing = CRC, encryption = >> unencrypted >> ERROR [Repair-Task:437] 2023-08-04 11:56:28,592 RepairRunnable.java:160 >> - Repair 30675c00-32df-11ee-a7d8-05183c68b0d0 failed: >> java.lang.RuntimeException: Did not get replies from all endpoints. >> at >> >> org.apache.cassandra.service.ActiveRepairService.failRepair(ActiveRepairService.java:721) >> at >> >> org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:654) >> at >> >> org.apache.cassandra.repair.RepairRunnable.prepare(RepairRunnable.java:400) >> at >> >> org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:279) >> at >> org.apache.cassandra.repair.RepairRunnable.run(RepairRunnable.java:248) >> at >> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) >> at >> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) >> at >> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) >> at >> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:81) >> at >> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:47) >> at >> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:57) >> at >> >> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> at >> >> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> at >> >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Thread.java:829) >> INFO [Repair-Task:437] 2023-08-04 11:56:28,594 RepairRunnable.java:223 >> - [repair #30675c00-32df-11ee-a7d8-05183c68b0d0]Repair command #522 >> finished with error >> >> What to do? >> Thanks! >> >> -Joe >> >> >> -- >> This email has been checked for viruses by AVG antivirus software. >> www.avg.com >> >> >> >> >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> Virus-free.www.avg.com >> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> <#m_4983271812333992115_m_266557648173100484_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> >