compaction became super slow after interrupted repair

Michał Łowicki Sat, 26 Sep 2015 07:34:32 -0700

Hi,

Running C* 2.1.8 cluster in two data centers with 6 nodes each. I've
started running repair sequentially on each node (`nodetool repair
--parallel --in-local-dc`).


While running repair number of SSTables grows radically as well as pending
compaction tasks. It's fine as node usually recovers within couple of hours
after finishing repair (
https://www.dropbox.com/s/xzcndf5596mq7rm/Screenshot%202015-09-26%2016.17.44.png?dl=0).
One experiment showed that increasing compaction throughput and number of
compactors mitigates this problem.

Unfortunately one node didn't recovered... (
https://www.dropbox.com/s/nphnsaf2rbfm0bq/Screenshot%202015-09-26%2016.20.56.png?dl=0).
I needed to interrupt repair as node was running out of disk space. I hoped
that within couple of hours node will catch up with compaction but it
didn't happen even after 5 days.

I've tried to increase throughput, disable throttling, increasing number of
compactors, disabling binary / thrift / gossip, increasing heap size,
restarting but still compaction is super slow.

Tried today to run scrub:

root@db2:~# nodetool scrub sync

Aborted scrubbing atleast one column family in keyspace sync, check server
logs for more information.

error: nodetool failed, check server logs

-- StackTrace --

java.lang.RuntimeException: nodetool failed, check server logs

        at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)

        at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

as well as cleanup:

root@db2:~# nodetool cleanup

Aborted cleaning up atleast one column family in keyspace sync, check
server logs for more information.

error: nodetool failed, check server logs

-- StackTrace --

java.lang.RuntimeException: nodetool failed, check server logs

        at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)

        at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

Couldn't find anything in logs regarding these runtime exceptions (see log
here - https://www.dropbox.com/s/flmii7fgpyp07q2/db2.lati.system.log?dl=0).

Note that I'm experiencing CASSANDRA-9935 while running repair on each node
from the cluster.

Any help will be much appreciated.

-- 
BR,
Michał Łowicki

compaction became super slow after interrupted repair

Reply via email to