Hi, Running C* 2.1.8 cluster in two data centers with 6 nodes each. I've started running repair sequentially on each node (`nodetool repair --parallel --in-local-dc`).
While running repair number of SSTables grows radically as well as pending compaction tasks. It's fine as node usually recovers within couple of hours after finishing repair ( https://www.dropbox.com/s/xzcndf5596mq7rm/Screenshot%202015-09-26%2016.17.44.png?dl=0). One experiment showed that increasing compaction throughput and number of compactors mitigates this problem. Unfortunately one node didn't recovered... ( https://www.dropbox.com/s/nphnsaf2rbfm0bq/Screenshot%202015-09-26%2016.20.56.png?dl=0). I needed to interrupt repair as node was running out of disk space. I hoped that within couple of hours node will catch up with compaction but it didn't happen even after 5 days. I've tried to increase throughput, disable throttling, increasing number of compactors, disabling binary / thrift / gossip, increasing heap size, restarting but still compaction is super slow. Tried today to run scrub: root@db2:~# nodetool scrub sync Aborted scrubbing atleast one column family in keyspace sync, check server logs for more information. error: nodetool failed, check server logs -- StackTrace -- java.lang.RuntimeException: nodetool failed, check server logs at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290) at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202) as well as cleanup: root@db2:~# nodetool cleanup Aborted cleaning up atleast one column family in keyspace sync, check server logs for more information. error: nodetool failed, check server logs -- StackTrace -- java.lang.RuntimeException: nodetool failed, check server logs at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290) at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202) Couldn't find anything in logs regarding these runtime exceptions (see log here - https://www.dropbox.com/s/flmii7fgpyp07q2/db2.lati.system.log?dl=0). Note that I'm experiencing CASSANDRA-9935 while running repair on each node from the cluster. Any help will be much appreciated. -- BR, Michał Łowicki