Rhys Campbell created CASSANDRA-15109:
-----------------------------------------
Summary: nodetool repair failing with "Validation failed in
/10.222.5.44"
Key: CASSANDRA-15109
URL: https://issues.apache.org/jira/browse/CASSANDRA-15109
Project: Cassandra
Issue Type: Bug
Components: Tool/nodetool
Reporter: Rhys Campbell
*Cassandra Version:* 2.2.13
*Command*
{noformat}
nodetool -h 127.0.0.1 -p 7199 repair -pr -full{noformat}
*Sample Output*
{noformat}
May 3 13:26:13 xxxxxxx cassandra: ERROR 11:26:13 Failed creating a merkle tree
for [repair #8a6859c0-6d95-11e9-b769-5964d82f38b1 on ks/table,
(4812194106185100517,5213210281700525452]], /X.X.5.42 (see log for
details){noformat}
On the mentioned node we have the following info logged...
{noformat}
May 3 13:26:13 XXXXXXXX cassandra: ERROR 11:26:13 Failed creating a merkle
tree for [repair #8a6859c0-6d95-11e9-b769-5964d82f38b1 on ks/taböe,
(4812194106185100517,5213210281700525452]], /X.X.5.42 (see log for
details){noformat}
These are always (as seen so far) preceeded by...
{noformat}
Apr 29 00:45:04 XXXXXXXX cassandra: INFO 22:45:04 InetAddress /X.X.5.42 is now
DOWN
Apr 29 00:45:09 XXXXXXXX cassandra: INFO 22:45:09 Handshaking version with
/10.223.5.42
Apr 29 00:45:09 XXXXXXXX cassandra: INFO 22:45:09 InetAddress /X.X.5.42 is now
UP{noformat}
and followed by a Java stack Trace...
{noformat}
Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Exception in thread
Thread[ValidationExecutor:43,1,main]
Apr 29 00:45:10 XXXXXXXX cassandra: java.lang.RuntimeException: Parent repair
session with id = 8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:398)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1206)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1131)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at java.lang.Thread.run(Thread.java:748)
[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Writing
Memtable-compactions_in_progress@2106381056(0.156KiB serialized bytes, 9 ops,
0%/0% of on/off-heap limit)
Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Handshaking version with
/10.223.5.42
Apr 29 00:45:10 XXXXXXXX cassandra: INFO 22:45:10 Writing
Memtable-compactions_in_progress@134296463(0.008KiB serialized bytes, 1 ops,
0%/0% of on/off-heap limit)
Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Got error, removing parent
repair session
Apr 29 00:45:10 XXXXXXXX cassandra: ERROR 22:45:10 Exception in thread
Thread[AntiEntropyStage:1,5,main]
Apr 29 00:45:10 XXXXXXXX cassandra: java.lang.RuntimeException:
java.lang.RuntimeException: Parent repair session with id =
8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:183)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
~[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: at java.lang.Thread.run(Thread.java:748)
[na:1.8.0_172]
Apr 29 00:45:10 XXXXXXXX cassandra: Caused by: java.lang.RuntimeException:
Parent repair session with id = 8f9fe6c0-6a06-11e9-bd05-21e986c06e90 has failed.
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:398)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:432)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: at
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:155)
~[apache-cassandra-2.2.13.jar:2.2.13]
Apr 29 00:45:10 XXXXXXXX cassandra: ... 6 common frames omitted{noformat}
I've tried a few combinations of options with the nodetool repair command. Here
are the results...
{noformat}
parallelism: parallel, primary range: true, incremental: false - NOK
parallelism: parallel, primary range: false, incremental: false - NOK
parallelism: parallel, primary range: false, incremental: false - NOK
parallelism: sequential, primary range: false, incremental: false - NOK
(Although I get a different error failed with error Could not create snapshot
at /X.X.5.43 (progress: 60%))
parallelism: parallel, primary range: false, incremental: true - OK
{noformat}
This only started happening relatively recently. There's been no major, or
minor changes, to our system that we think would result in this. This is
happening on every node in one DC and on a few in the second. The "Failed
creating merkle tree" error is present on every node but most of the nodes in
the second DC seem to complete their repair.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]