Re: Repair errors

2023-08-11 Thread Surbhi Gupta
Try sstablescrub on the node where it is showing corrupted data.

On Fri, Aug 11, 2023 at 8:38 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Finally found a message on another node that seem relevant:
>
> INFO  [CompactionExecutor:7413] 2023-08-11 11:36:22,397
> CompactionTask.java:164 - Compacting (d30b64ba-385c-11ee-8e74-edf5512ad115)
> [/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97958-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-91664-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90239-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99385-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101078-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-86112-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90753-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-5-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94008-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92338-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-87273-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-82398-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94244-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-80384-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65431-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90412-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90104-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-85155-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92914-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-78344-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-53269-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99242-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73898-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-100473-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-76035-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101352-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-62093-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93643-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97812-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73062-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65491-big-Data.db:level=0,
> /data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93299-big-Data.db:level=0,
> ]
> DEBUG [CompactionExecutor:7412] 2023-08-11 11:36:22,398
> Directories.java:502 - DataDirectory /data/7/cassandra/data has 9194752
> bytes available, checking if we can write 10716461 bytes
> INFO  [CompactionExecutor:7412] 2023-08-11 11:36:22,398
> CompactionTask.java:164 - Compacting (d30b64b0-385c-11ee-8e74-edf5512ad115)
> [/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36867-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32270-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32287-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-30785-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32545-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38791-big-Data.db:level=0,
> /data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38586-big-Data.db:level=0,

Re: Repair errors

2023-08-11 Thread Joe Obernberger

Finally found a message on another node that seem relevant:

INFO  [CompactionExecutor:7413] 2023-08-11 11:36:22,397 
CompactionTask.java:164 - Compacting 
(d30b64ba-385c-11ee-8e74-edf5512ad115) 
[/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97958-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-91664-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90239-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99385-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101078-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-86112-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90753-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-5-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94008-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92338-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-87273-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-82398-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-94244-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-80384-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65431-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90412-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-90104-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-85155-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-92914-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-78344-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-53269-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-99242-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73898-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-100473-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-76035-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-101352-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-62093-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93643-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-97812-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-73062-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-65491-big-Data.db:level=0, 
/data/3/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-93299-big-Data.db:level=0, 
]
DEBUG [CompactionExecutor:7412] 2023-08-11 11:36:22,398 
Directories.java:502 - DataDirectory /data/7/cassandra/data has 
9194752 bytes available, checking if we can write 10716461 bytes
INFO  [CompactionExecutor:7412] 2023-08-11 11:36:22,398 
CompactionTask.java:164 - Compacting 
(d30b64b0-385c-11ee-8e74-edf5512ad115) 
[/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36867-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32270-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32287-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-30785-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-32545-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38791-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-38586-big-Data.db:level=0, 
/data/3/cassandra/data/doc/source_correlations-4ce2d9f0912b11edbd6d4d9b3bfd78b2/nb-36849-big-Data.db:level=0, 

Re: Repair errors

2023-08-07 Thread manish khandelwal
What logs of  /172.16.20.16:7000 say when repair failed. It indicates
"validation failed". Can you check system.log for  /172.16.20.16:7000 and
see what they say. Looks like you have some issue with *doc/origdoc,
probably some corrupt sstable.  *Try to run repair for individual table and
see for which table repair fails.

Regards
Manish

On Mon, Aug 7, 2023 at 11:39 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you.  I've tried:
> nodetool repair --full
> nodetool repair -pr
> They all get to 57% on any of the nodes, and then fail.  Interestingly the
> debug log only has INFO - there are no errors.
>
> [2023-08-07 14:02:09,828] Repair command #6 failed with error Incremental
> repair session 83dc17d0-354c-11ee-809c-177460b0ed52 has failed
> [2023-08-07 14:02:09,830] Repair command #6 finished with error
> error: Repair job has failed with the error message: Repair command #6
> failed with error Incremental repair session
> 83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the
> repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> Repair command #6 failed with error Incremental repair session
> 83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the
> repair participants for further details
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
> at
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
> at
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
> at
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
> at
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> Full repair results on another node:
>
>
> [2023-08-04 20:21:42,575] Repair session
> 14830280-3304-11ee-939d-635768ac938c for range
> [(-5756366402057257951,-5754159509763216479],
> (-2469484655657848961,-2461953651636879320],
> (-5175468354897450191,-5171107677178073434],
> (-628587988891618162,-624346074440106568],
> (-6615381309032691143,-6603240846496048854],
> (6616005974054228159,6628798414170514490],
> (8013321283688199900,8017115978405113835],
> (-7829682363035100161,-782466028871477],
> (2848484090138352114,2852114415040125826],
> (-2477015659678818602,-2469484655657848961],
> (-2483470805982506865,-2477015659678818602]] finished (progress: 57%)
> [2023-08-04 20:36:23,786] Repair session
> 14cbcb50-3304-11ee-939d-635768ac938c for range
> [(5193761311910499374,5197212898580538329],
> (-1679246469353274066,-1672836360726470435],
> (-6927245454058012407,-6922951496140109663],
> (1851771008808005661,1854683726231521039],
> (5197212898580538329,5200664485250577285],
> (1848858291384490283,1851771008808005661],
> (-4736378492502250338,-4732073287189625685],
> (-2705389975640427939,-2699099608948332293],
> (-7806270378003956741,-7796905583991499373],
> (466064862768270626,473304202405656261],
> (250549667892224144,253421473349298265],
> (-6922951496140109663,-6920804517181158291],
> (249113765163687083,250549667892224144],
> (1854683726231521039,1857596443655036418],
> (4687110928509362134,4694325991399541085],
> (-6920804517181158291,-691865753806919],
> (4399045818626652943,4402968741621424236],
> (473304202405656261,480543542043041896]] finished (progress: 57%)
> [2023-08-04 20:36:23,795] Repair command #12 finished with error
> error: Repair job has failed with the error message: Repair command #12
> failed with error Repair session 154f5330-3304-11ee-939d-635768ac938c for
> range [(5333449259855342357,5338449508113440752],
> (4959134492108085445,4965331080956982133],
> (5938148666505886222,5945280202710590417],
> (8428867157147807368,8431880058869458408],
> (5338449508113440752,5343449756371539147]] failed with error [repair
> #154f5330-3304-11ee-939d-635768ac938c on doc/origdoc,
> [(5333449259855342357,5338449508113440752],
> (4959134492108085445,4965331080956982133],
> (5938148666505886222,5945280202710590417],
> (8428867157147807368,8431880058869458408],
> (5338449508113440752,5343449756371539147]]] Validation failed in /
> 172.16.20.16:7000. Check the logs on the repair participants for further
> details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> Repair command #12 failed with error Repair session
> 154f5330-3304-11ee-939d-635768ac938c for range
> [(5333449259855342357,5338449508113440752],
> (4959134492108085445,4965331080956982133],
> (5938148666505886222,5945280202710590417],
> 

Re: Repair errors

2023-08-07 Thread Joe Obernberger

Thank you.  I've tried:
nodetool repair --full
nodetool repair -pr
They all get to 57% on any of the nodes, and then fail. Interestingly 
the debug log only has INFO - there are no errors.


[2023-08-07 14:02:09,828] Repair command #6 failed with error 
Incremental repair session 83dc17d0-354c-11ee-809c-177460b0ed52 has failed

[2023-08-07 14:02:09,830] Repair command #6 finished with error
error: Repair job has failed with the error message: Repair command #6 
failed with error Incremental repair session 
83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the 
repair participants for further details

-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error 
message: Repair command #6 failed with error Incremental repair session 
83dc17d0-354c-11ee-809c-177460b0ed52 has failed. Check the logs on the 
repair participants for further details
    at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
    at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
    at 
java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
    at 
java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
    at 
java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
    at 
java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)

    at java.base/java.lang.Thread.run(Thread.java:829)

Full repair results on another node:


[2023-08-04 20:21:42,575] Repair session 
14830280-3304-11ee-939d-635768ac938c for range 
[(-5756366402057257951,-5754159509763216479], 
(-2469484655657848961,-2461953651636879320], 
(-5175468354897450191,-5171107677178073434], 
(-628587988891618162,-624346074440106568], 
(-6615381309032691143,-6603240846496048854], 
(6616005974054228159,6628798414170514490], 
(8013321283688199900,8017115978405113835], 
(-7829682363035100161,-782466028871477], 
(2848484090138352114,2852114415040125826], 
(-2477015659678818602,-2469484655657848961], 
(-2483470805982506865,-2477015659678818602]] finished (progress: 57%)
[2023-08-04 20:36:23,786] Repair session 
14cbcb50-3304-11ee-939d-635768ac938c for range 
[(5193761311910499374,5197212898580538329], 
(-1679246469353274066,-1672836360726470435], 
(-6927245454058012407,-6922951496140109663], 
(1851771008808005661,1854683726231521039], 
(5197212898580538329,5200664485250577285], 
(1848858291384490283,1851771008808005661], 
(-4736378492502250338,-4732073287189625685], 
(-2705389975640427939,-2699099608948332293], 
(-7806270378003956741,-7796905583991499373], 
(466064862768270626,473304202405656261], 
(250549667892224144,253421473349298265], 
(-6922951496140109663,-6920804517181158291], 
(249113765163687083,250549667892224144], 
(1854683726231521039,1857596443655036418], 
(4687110928509362134,4694325991399541085], 
(-6920804517181158291,-691865753806919], 
(4399045818626652943,4402968741621424236], 
(473304202405656261,480543542043041896]] finished (progress: 57%)

[2023-08-04 20:36:23,795] Repair command #12 finished with error
error: Repair job has failed with the error message: Repair command #12 
failed with error Repair session 154f5330-3304-11ee-939d-635768ac938c 
for range [(5333449259855342357,5338449508113440752], 
(4959134492108085445,4965331080956982133], 
(5938148666505886222,5945280202710590417], 
(8428867157147807368,8431880058869458408], 
(5338449508113440752,5343449756371539147]] failed with error [repair 
#154f5330-3304-11ee-939d-635768ac938c on doc/origdoc, 
[(5333449259855342357,5338449508113440752], 
(4959134492108085445,4965331080956982133], 
(5938148666505886222,5945280202710590417], 
(8428867157147807368,8431880058869458408], 
(5338449508113440752,5343449756371539147]]] Validation failed in 
/172.16.20.16:7000. Check the logs on the repair participants for 
further details

-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error 
message: Repair command #12 failed with error Repair session 
154f5330-3304-11ee-939d-635768ac938c for range 
[(5333449259855342357,5338449508113440752], 
(4959134492108085445,4965331080956982133], 
(5938148666505886222,5945280202710590417], 
(8428867157147807368,8431880058869458408], 
(5338449508113440752,5343449756371539147]] failed with error [repair 
#154f5330-3304-11ee-939d-635768ac938c on doc/origdoc, 
[(5333449259855342357,5338449508113440752], 
(4959134492108085445,4965331080956982133], 
(5938148666505886222,5945280202710590417], 
(8428867157147807368,8431880058869458408], 
(5338449508113440752,5343449756371539147]]] Validation failed in 
/172.16.20.16:7000. Check the logs on the repair participants for 
further details
    at 

Re: Repair errors

2023-08-06 Thread Josh McKenzie
Quick drive-by observation:
> Did not get replies from all endpoints.. Check the 
> logs on the repair participants for further details

> dropping message of type HINT_REQ due to error
> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The
> channel this output stream was writing to has been closed

> Caused by: io.netty.channel.unix.Errors$NativeIoException:
> writeAddress(..) failed: Connection timed out

> java.lang.RuntimeException: Did not get replies from all endpoints.
These all point to the same shaped problem: for whatever reason, the 
coordinator of this repair didn't receive replies from the replicas executing 
it. Could be that they're dead, could be they took too long, could be they 
never got the start message, etc. Distributed operations are tricky like that.

Logs on the replicas doing the actual repairs should give you more insight; 
this is a pretty low level generic set of errors that basically amounts to "we 
didn't hear back from the other participants in time so we timed out."

On Fri, Aug 4, 2023, at 12:02 PM, Surbhi Gupta wrote:
> Can you please try to do nodetool describecluster from every node of the 
> cluster?
> 
> One time I noticed issue when nodetool status shows all nodes UN but 
> describecluster was not.
> 
> Thanks
> Surbhi
> 
> On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger  
> wrote:
>> Hi All - been using reaper to do repairs, but it has hung.  I tried to run:
>> nodetool repair -pr
>> on each of the nodes, but they all fail with some form of this error:
>> 
>> error: Repair job has failed with the error message: Repair command #521 
>> failed with error Did not get replies from all endpoints.. Check the 
>> logs on the repair participants for further details
>> -- StackTrace --
>> java.lang.RuntimeException: Repair job has failed with the error 
>> message: Repair command #521 failed with error Did not get replies from 
>> all endpoints.. Check the logs on the repair participants for further 
>> details
>>  at 
>> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>>  at 
>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
>>  at java.base/java.lang.Thread.run(Thread.java:829)
>> 
>> Using version 4.1.2-1
>> nodetool status
>> Datacenter: datacenter1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address LoadTokens  Owns  Host 
>> ID   Rack
>> UN  172.16.100.45   505.66 GiB  250 ? 
>> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
>> UN  172.16.100.251  380.75 GiB  200 ? 
>> 274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
>> UN  172.16.100.35   479.2 GiB   200 ? 
>> 59150c47-274a-46fb-9d5e-bed468d36797  rack1
>> UN  172.16.100.252  248.69 GiB  200 ? 
>> 8f0d392f-0750-44e2-91a5-b30708ade8e4  rack1
>> UN  172.16.100.249  411.53 GiB  200 ? 
>> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
>> UN  172.16.100.38   333.26 GiB  200 ? 
>> 0d9509cc-2f23-4117-a883-469a1be54baf  rack1
>> UN  172.16.100.36   405.33 GiB  200 ? 
>> d9702f96-256e-45ae-8e12-69a42712be50  rack1
>> UN  172.16.100.39   437.74 GiB  200 ? 
>> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
>> UN  172.16.100.248  344.4 GiB   200 ? 
>> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
>> UN  172.16.100.44   409.36 GiB  200 ? 
>> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
>> UN  172.16.100.37   236.08 GiB  120 ? 
>> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
>> UN  172.16.20.16975 GiB 500 ? 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
>> UN  172.16.100.34   340.77 GiB  200 ? 
>> 352fd049-32f8-4be8-9275-68b145ac2832  rack1
>> UN  172.16.100.42   974.86 GiB  500 ? 
>> b088a8e6-42f3-4331-a583-47ef5149598f  rack1
>> 
>> Note: Non-system keyspaces don't have the same replication settings, 
>> effective ownership information is meaningless
>> 
>> Debug log has:
>> 
>> 
>> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 
>> MigrationCoordinator.java:264 - Pulling unreceived schema versions...
>> INFO  [HintsDispatcher:11344] 2023-08-04 11:56:21,369 
>> HintsDispatchExecutor.java:318 - Finished hinted handoff of file 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint 
>> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially
>> WARN 
>> 

Re: Repair errors

2023-08-04 Thread Surbhi Gupta
Can you please try to do nodetool describecluster from every node of the
cluster?

One time I noticed issue when nodetool status shows all nodes UN but
describecluster was not.

Thanks
Surbhi

On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger 
wrote:

> Hi All - been using reaper to do repairs, but it has hung.  I tried to run:
> nodetool repair -pr
> on each of the nodes, but they all fail with some form of this error:
>
> error: Repair job has failed with the error message: Repair command #521
> failed with error Did not get replies from all endpoints.. Check the
> logs on the repair participants for further details
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error
> message: Repair command #521 failed with error Did not get replies from
> all endpoints.. Check the logs on the repair participants for further
> details
>  at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>  at
>
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>  at
>
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
>  at
>
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
>  at
>
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
>  at
>
> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
>  at java.base/java.lang.Thread.run(Thread.java:829)
>
> Using version 4.1.2-1
> nodetool status
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address LoadTokens  Owns  Host
> ID   Rack
> UN  172.16.100.45   505.66 GiB  250 ?
> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
> UN  172.16.100.251  380.75 GiB  200 ?
> 274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
> UN  172.16.100.35   479.2 GiB   200 ?
> 59150c47-274a-46fb-9d5e-bed468d36797  rack1
> UN  172.16.100.252  248.69 GiB  200 ?
> 8f0d392f-0750-44e2-91a5-b30708ade8e4  rack1
> UN  172.16.100.249  411.53 GiB  200 ?
> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
> UN  172.16.100.38   333.26 GiB  200 ?
> 0d9509cc-2f23-4117-a883-469a1be54baf  rack1
> UN  172.16.100.36   405.33 GiB  200 ?
> d9702f96-256e-45ae-8e12-69a42712be50  rack1
> UN  172.16.100.39   437.74 GiB  200 ?
> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
> UN  172.16.100.248  344.4 GiB   200 ?
> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
> UN  172.16.100.44   409.36 GiB  200 ?
> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
> UN  172.16.100.37   236.08 GiB  120 ?
> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
> UN  172.16.20.16975 GiB 500 ?
> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
> UN  172.16.100.34   340.77 GiB  200 ?
> 352fd049-32f8-4be8-9275-68b145ac2832  rack1
> UN  172.16.100.42   974.86 GiB  500 ?
> b088a8e6-42f3-4331-a583-47ef5149598f  rack1
>
> Note: Non-system keyspaces don't have the same replication settings,
> effective ownership information is meaningless
>
> Debug log has:
>
>
> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955
> MigrationCoordinator.java:264 - Pulling unreceived schema versions...
> INFO  [HintsDispatcher:11344] 2023-08-04 11:56:21,369
> HintsDispatchExecutor.java:318 - Finished hinted handoff of file
> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint
> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially
> WARN
> [Messaging-OUT-/172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES]
> 2023-08-04 11:56:21,916 OutboundConnection.java:491 -
> /172.16.100.34:7000->/172.16.20.16:7000-LARGE_MESSAGES-[no-channel]
> dropping message of type HINT_REQ due to error
> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The
> channel this output stream was writing to has been closed
>  at
> org.apache.cassandra.net
> .AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)
>  at
> org.apache.cassandra.net
> .AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
>  at
> org.apache.cassandra.net
> .AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140)
>  at
> org.apache.cassandra.net
> .AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97)
>  at
> org.apache.cassandra.net
> .AsyncMessageOutputPlus.doFlush(AsyncMessageOutputPlus.java:100)
>  at
>
> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:122)
>  at
>
> org.apache.cassandra.hints.HintMessage$Serializer.serialize(HintMessage.java:139)
>  at
>
>