I am running a 6 node cluster using Apache Cassandra 2.1.2 with DataStax
OpsCenter 5.0.2 from the AWS EC2 AMI "DataStax Auto-Clustering AMI
2.5.1-hvm" (DataStax Community AMI). When I try to run a repair on the
rollups60 column family in the OpsCenter keyspace, I get errors about
failed snapshot creation in the Cassandra system log. The repair seems to
continue, and then finishes with errors.

I am wondering whether this is making the repair ineffectual.

I am running the command

    nodetool repair OpsCenter rollups60

on one of the nodes (10.63.74.70). From the command, I get this output:

    [2015-01-23 19:36:06,261] Starting repair command #9, repairing 511
ranges for keyspace OpsCenter (seq=true, full=true)
    [2015-01-23 21:08:16,242] Repair session
67772db0-a337-11e4-9e78-37e5027a626b for range
(5848435723460298978,5868916338423419522] failed with error
java.io.IOException: Failed during snapshot creation.

The error is repeated many times, and they all appear right at the end.
Here is an example of what I see in the log on that same system (the one
that I'm running the command from, and the one that's trying to snapshot):

    INFO  [AntiEntropyStage:1] 2015-01-23 19:38:28,235
RepairSession.java:171 - [repair #138b42e0-a337-11e4-9e78-37e5027a626b]
Received merkle tree for rollups60 from /10.63.74.70
    INFO  [AntiEntropySessions:9] 2015-01-23 19:38:28,236
RepairSession.java:260 - [repair #67772db0-a337-11e4-9e78-37e5027a626b] new
session: will sync /10.63.74.70, /10.51.180.16 on range
(5848435723460298978,5868916338423419522] for OpsCenter.[rollups60]
    INFO  [RepairJobTask:3] 2015-01-23 19:38:28,237 Differencer.java:74 -
[repair #138b42e0-a337-11e4-9e78-37e5027a626b] Endpoints /10.13.157.190 and
/10.63.74.70 have 1 range(s) out of sync for rollups60
    INFO  [AntiEntropyStage:1] 2015-01-23 19:38:28,237
ColumnFamilyStore.java:840 - Enqueuing flush of rollups60: 465365 (0%)
on-heap, 0 (0%) off-heap
    INFO  [MemtableFlushWriter:25] 2015-01-23 19:38:28,238
Memtable.java:325 - Writing Memtable-rollups60@204861223(51960 serialized
bytes, 1395 ops, 0%/0% of on/off-heap limit)
    INFO  [RepairJobTask:3] 2015-01-23 19:38:28,239
StreamingRepairTask.java:68 - [streaming task
#138b42e0-a337-11e4-9e78-37e5027a626b] Performing streaming repair of 1
ranges with /10.13.157.190
    INFO  [MemtableFlushWriter:25] 2015-01-23 19:38:28,262
Memtable.java:364 - Completed flushing
/raid0/cassandra/data/OpsCenter/rollups60-445613507ca411e4bd3f1927a2a71193/OpsCenter-rollups60-ka-331933-Data.db
(29998 bytes) for commitlog position
ReplayPosition(segmentId=1422038939094, position=31047766)
    ERROR [RepairJobTask:2] 2015-01-23 19:38:39,067 RepairJob.java:127 -
Error occurred during snapshot phase
    java.lang.RuntimeException: Could not create snapshot at /10.63.74.70
            at
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:347)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[na:1.7.0_51]
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
~[na:1.7.0_51]
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_51]
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]
            at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
    INFO  [AntiEntropySessions:10] 2015-01-23 19:38:39,068
RepairSession.java:260 - [repair #6dec29c0-a337-11e4-9e78-37e5027a626b] new
session: will sync /10.63.74.70, /10.51.180.16 on range
(-6918744323658665195,-6916171087863528821] for OpsCenter.[rollups60]
    ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,068
RepairSession.java:303 - [repair #67772db0-a337-11e4-9e78-37e5027a626b]
session completed with the following error
    java.io.IOException: Failed during snapshot creation.
            at
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
com.google.common.util.concurrent.Futures$4.run(Futures.java:1172)
~[guava-16.0.jar:na]
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_51]
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]
            at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
    ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,070
CassandraDaemon.java:153 - Exception in thread
Thread[AntiEntropySessions:9,5,RMI Runtime]
    java.lang.RuntimeException: java.io.IOException: Failed during snapshot
creation.
            at
com.google.common.base.Throwables.propagate(Throwables.java:160)
~[guava-16.0.jar:na]
            at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[na:1.7.0_51]
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
~[na:1.7.0_51]
            at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[na:1.7.0_51]
            at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]
            at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
    Caused by: java.io.IOException: Failed during snapshot creation.
            at
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128)
~[apache-cassandra-2.1.2.jar:2.1.2]
            at
com.google.common.util.concurrent.Futures$4.run(Futures.java:1172)
~[guava-16.0.jar:na]
            ... 3 common frames omitted

The errors are repeated many times. The IP Address 10.63.74.70 in the log
is the node I'm running the repair from. I am able to repair the rest of
the OpsCenter column families, and they complete pretty quickly without
error.

I have tried creating my own snapshot, and it completes successfully with
nothing logged.

    nodetool snapshot OpsCenter

The disk has plenty of space left. Are these errors problematic? Should I
just let the repair process continue for however long it takes? The cluster
is currently not in use by any application, yet it has some load while it's
trying this repair, so it's not sitting idle (it has no load when I'm not
repairing).

Thanks for any help.

And if this is the wrong place to ask about a DataStax Community thing,
could someone point me in the right direction?

 ~ Paul Nickerson

Reply via email to