[ https://issues.apache.org/jira/browse/CASSANDRA-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13914895#comment-13914895 ]
Robert Coli commented on CASSANDRA-6756: ---------------------------------------- The behavior of Cassandra loading data files which happen to be in the data dir is useful in various cases, but most of those cases would be addressed fine with a safe version of "refresh." As a configurable option (with a log about the unexpected files?) this ticket seems reasonable as a protection against unintentional data gain... except that leaving these files in the data dir in a non-live state makes them susceptible to being silently overwritten by Cassandra. CASSANDRA-6719 (CASSANDRA-6245 / CASSANDRA-6514) is about a certain case where non-live files end up in the data directory, but this ticket suggests that there is a more general issue. I would probably be fine if, given the proposed option, the non-live SSTables were moved to a "lost+found" directory so that they are protected from being silently overwritten by flush. The simplest solution to preventing silent overwriting of accidentally dead SSTables in the data directory would seem to be to check for the existence of a file with a given name at flush time, and to increment the sequence until such a file does not exist.. Should I file a JIRA for either or both of : 1) move orphan SSTables to lost+found directory on startup? (or might this be that ticket?) 2) check for existence of SSTables before flushing? > Provide option to avoid loading orphan SSTables on startup > ---------------------------------------------------------- > > Key: CASSANDRA-6756 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6756 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Vincent Mallet > Fix For: 1.2.16 > > > When Cassandra starts up, it enumerates all SSTables on disk for a known > column family and proceeds to loading all of them, even those that were left > behind before the restart because of a problem of some sort. This can lead to > "data gain" (resurrected data) which is just as bad as data loss. > The ask is to provide a yaml config option which would allow one to turn that > behavior off by default so a cassandra cluster would be immune to data gain > when nodes get restarted (at least with Leveled where Cassandra keeps track > of SSTables). > This is sort of a follow-up to CASSANDRA-6503 (fixed in 1.2.14). We're just > extremely nervous that orphan SSTables could appear because of some other > potential problem somewhere else and cause zombie data on a random reboot. -- This message was sent by Atlassian JIRA (v6.1.5#6160)