[
https://issues.apache.org/jira/browse/CASSANDRA-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua McKenzie updated CASSANDRA-6797:
---------------------------------------
Component/s: Lifecycle
Compaction
> compaction and scrub data directories race on startup
> -----------------------------------------------------
>
> Key: CASSANDRA-6797
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
> Project: Cassandra
> Issue Type: Bug
> Components: Compaction, Lifecycle
> Environment: macos (and linux)
> Reporter: Matt Byrd
> Assignee: Joshua McKenzie
> Priority: Minor
> Labels: compaction, concurrency, starting
> Fix For: 2.0.6, 2.1 beta2
>
> Attachments: trunk-6797.patch
>
>
>
> Hi,
> On doing a rolling restarting of a 2.0.5 cluster in several environments I'm
> seeing the following error:
> {code}
> INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java
> (line 115) Compacting
> [SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
> SSTableReader(path='/Users/Matthew/.ccm/compactio
> n_race/node1/data/system/local/system-local-jb-15-Data.db'),
> SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
>
> SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
> em-local-jb-14-Data.db')]
> INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java
> (line 254) Initializing system_traces.sessions
> INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java
> (line 254) Initializing system_traces.events
> WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473)
> Removing orphans for
> /Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13:
> [CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db,
> Statistics.
> db]
> ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479)
> Exception encountered during startup
> java.lang.AssertionError: attempted to delete non-existing file
> system-local-jb-13-CompressionInfo.db
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
> at
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
> INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java
> (line 275) Compacted 4 sstables to
> [/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
> 10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s. 4 total
> partitions merged to 1. Partition merge counts were {4:1, }
> {code}
> Seems like a potential race, since compactions are occurring whilst the
> existing data directories are being scrubbed.
> Probably an in progress compaction looks like an incomplete one and results
> in it being attempted to be scrubbed whilst in progress.
> On the attempt to delete in the scrubDataDirectories we discover that it no
> longer exists, presumably because it has now been compacted away.
> This then causes an assertion error and the node fails to start up.
> Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster
> repeatedly.
> It seems to fairly reliably reproduce the problem, in less than ten
> iterations:
> {code}
> #!/bin/bash
> ccm create compaction_race -v 2.0.5
> ccm populate -n 3
> ccm start
> for i in $(seq 0 1000); do
> echo $i;
> ccm stop
> ccm start
> grep ERR ~/.ccm/compaction_race/*/logs/system.log;
> done
> {code}
>
> Someone else should probably confirm that this is what is going wrong,
> however if it is, the solution might be as simple as to disable
> autocompactions slightly earlier in CassandraDaemon.setup.
>
> Or alternatively if there isn't a good reason why we are first scrubbing the
> system tables and then scrubbing all keyspaces (including the system
> keyspace), you could perhaps just scrub solely the non system keyspaces on
> the second scrub.
> Please let me know if there is anything else I can provide.
> Thanks,
> Matt
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)