Matt Byrd created CASSANDRA-6797:
------------------------------------
Summary: compaction and scrub data directories race on startup
Key: CASSANDRA-6797
URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
Project: Cassandra
Issue Type: Bug
Components: Core
Environment: macos (and linux)
Reporter: Matt Byrd
Priority: Minor
Hi,
On doing a rolling restarting of a 2.0.5 cluster in several environments I'm
seeing the following error:
{code}
INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java (line
115) Compacting
[SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
SSTableReader(path='/Users/Matthew/.ccm/compactio
n_race/node1/data/system/local/system-local-jb-15-Data.db'),
SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
em-local-jb-14-Data.db')]
INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java
(line 254) Initializing system_traces.sessions
INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java
(line 254) Initializing system_traces.events
WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473) Removing
orphans for
/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13:
[CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db,
Statistics.
db]
ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479) Exception
encountered during startup
java.lang.AssertionError: attempted to delete non-existing file
system-local-jb-13-CompressionInfo.db
at
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
at
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
at
org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java (line
275) Compacted 4 sstables to
[/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s. 4 total
partitions merged to 1. Partition merge counts were {4:1, }
{code}
Seems like a potential race, since compactions are occurring whilst the
existing data directories are being scrubbed.
Probably an in progress compaction looks like an incomplete one and results in
it being attempted to be scrubbed whilst in progress.
On the attempt to delete in the scrubDataDirectories we discover that it no
longer exists, presumably because it has now been compacted away.
This then causes an assertion error and the node fails to start up.
Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster
repeatedly.
It seems to fairly reliably reproduce the problem, in less than ten iterations:
{code}
#!/bin/bash
ccm create compaction_race -v 2.0.5
ccm populate -n 3
ccm start
for i in $(seq 0 1000); do
echo $i;
ccm stop
ccm start
grep ERR ~/.ccm/compaction_race/*/logs/system.log;
done
{code}
Someone else should probably confirm that this is what is going wrong,
however if it is, the solution might be as simple as to disable autocompactions
slightly earlier in CassandraDaemon.setup.
Or alternatively if there isn't a good reason why we are first scrubbing the
system tables and then scrubbing all keyspaces (including the system keyspace),
you could perhaps just scrub solely the non system keyspaces on the second
scrub.
Please let me know if there is anything else I can provide.
Thanks,
Matt
--
This message was sent by Atlassian JIRA
(v6.2#6252)