I am running a 6 node cluster using Apache Cassandra 2.1.2 with DataStax OpsCenter 5.0.2 from the AWS EC2 AMI "DataStax Auto-Clustering AMI 2.5.1-hvm" (DataStax Community AMI). When I try to run a repair on the rollups60 column family in the OpsCenter keyspace, I get errors about failed snapshot creation in the Cassandra system log. The repair seems to continue, and then finishes with errors.
I am wondering whether this is making the repair ineffectual. I am running the command nodetool repair OpsCenter rollups60 on one of the nodes (10.63.74.70). From the command, I get this output: [2015-01-23 19:36:06,261] Starting repair command #9, repairing 511 ranges for keyspace OpsCenter (seq=true, full=true) [2015-01-23 21:08:16,242] Repair session 67772db0-a337-11e4-9e78-37e5027a626b for range (5848435723460298978,5868916338423419522] failed with error java.io.IOException: Failed during snapshot creation. The error is repeated many times, and they all appear right at the end. Here is an example of what I see in the log on that same system (the one that I'm running the command from, and the one that's trying to snapshot): INFO [AntiEntropyStage:1] 2015-01-23 19:38:28,235 RepairSession.java:171 - [repair #138b42e0-a337-11e4-9e78-37e5027a626b] Received merkle tree for rollups60 from /10.63.74.70 INFO [AntiEntropySessions:9] 2015-01-23 19:38:28,236 RepairSession.java:260 - [repair #67772db0-a337-11e4-9e78-37e5027a626b] new session: will sync /10.63.74.70, /10.51.180.16 on range (5848435723460298978,5868916338423419522] for OpsCenter.[rollups60] INFO [RepairJobTask:3] 2015-01-23 19:38:28,237 Differencer.java:74 - [repair #138b42e0-a337-11e4-9e78-37e5027a626b] Endpoints /10.13.157.190 and /10.63.74.70 have 1 range(s) out of sync for rollups60 INFO [AntiEntropyStage:1] 2015-01-23 19:38:28,237 ColumnFamilyStore.java:840 - Enqueuing flush of rollups60: 465365 (0%) on-heap, 0 (0%) off-heap INFO [MemtableFlushWriter:25] 2015-01-23 19:38:28,238 Memtable.java:325 - Writing Memtable-rollups60@204861223(51960 serialized bytes, 1395 ops, 0%/0% of on/off-heap limit) INFO [RepairJobTask:3] 2015-01-23 19:38:28,239 StreamingRepairTask.java:68 - [streaming task #138b42e0-a337-11e4-9e78-37e5027a626b] Performing streaming repair of 1 ranges with /10.13.157.190 INFO [MemtableFlushWriter:25] 2015-01-23 19:38:28,262 Memtable.java:364 - Completed flushing /raid0/cassandra/data/OpsCenter/rollups60-445613507ca411e4bd3f1927a2a71193/OpsCenter-rollups60-ka-331933-Data.db (29998 bytes) for commitlog position ReplayPosition(segmentId=1422038939094, position=31047766) ERROR [RepairJobTask:2] 2015-01-23 19:38:39,067 RepairJob.java:127 - Error occurred during snapshot phase java.lang.RuntimeException: Could not create snapshot at /10.63.74.70 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:347) ~[apache-cassandra-2.1.2.jar:2.1.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] INFO [AntiEntropySessions:10] 2015-01-23 19:38:39,068 RepairSession.java:260 - [repair #6dec29c0-a337-11e4-9e78-37e5027a626b] new session: will sync /10.63.74.70, /10.51.180.16 on range (-6918744323658665195,-6916171087863528821] for OpsCenter.[rollups60] ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,068 RepairSession.java:303 - [repair #67772db0-a337-11e4-9e78-37e5027a626b] session completed with the following error java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.2.jar:2.1.2] at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,070 CassandraDaemon.java:153 - Exception in thread Thread[AntiEntropySessions:9,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation. at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) ~[apache-cassandra-2.1.2.jar:2.1.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] Caused by: java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.2.jar:2.1.2] at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.2.jar:2.1.2] at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na] ... 3 common frames omitted The errors are repeated many times. The IP Address 10.63.74.70 in the log is the node I'm running the repair from. I am able to repair the rest of the OpsCenter column families, and they complete pretty quickly without error. I have tried creating my own snapshot, and it completes successfully with nothing logged. nodetool snapshot OpsCenter The disk has plenty of space left. Are these errors problematic? Should I just let the repair process continue for however long it takes? The cluster is currently not in use by any application, yet it has some load while it's trying this repair, so it's not sitting idle (it has no load when I'm not repairing). Thanks for any help. And if this is the wrong place to ask about a DataStax Community thing, could someone point me in the right direction? ~ Paul Nickerson