You are right about the behavior of cassandra compaction. It checks if the key exists on the other SSTable files that are not in the compaction set.
I think https://issues.apache.org/jira/browse/CASSANDRA-4671 would help if you upgrade to latest 1.2, but in your version, I think the workaround is to stop write not to flush, then compact. On Thu, May 16, 2013 at 3:07 AM, Boris Yen <yulin...@gmail.com> wrote: > Hi All, > > Sorry for the wide distribution. > > Our cassandra is running on 1.0.10. Recently, we are facing a weird > situation. We have a column family containing wide rows (each row might > have a few million of columns). We delete the columns on a daily basis and > we also run major compaction on it everyday to free up disk space (the > gc_grace is set to 600 seconds). > > However, every time we run the major compaction, only 1 or 2GB disk space > is freed. We tried to delete most of the data before running compaction, > however, the result is pretty much the same. > > So, we tried to check the source code. It seems that the column tombstones > could only be purged when the row key is not in other sstables. I know the > major compaction should include all sstables, however, in our use case, > columns get inserted rapidly. This will make the cassandra flush the > memtables to disk and create new sstables. The newly created sstables will > have the same keys as the sstables that are being compacted (the compaction > will take 2 or 3 hours to finish). My question is that will these newly > created sstables be the cause of why most of the column-tombstone not being > purged? > > p.s. We also did some other tests. We inserted data to the same CF with the > same wide-row pattern and deleted most of the data. This time we stopped > all the writes to cassandra and did the compaction. The disk usage > decreased dramatically. > > Any suggestions or is this a know issue. > > Thanks and Regards, > Boris -- Yuki Morishita t:yukim (http://twitter.com/yukim)