[
https://issues.apache.org/jira/browse/CASSANDRA-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Boudreault resolved CASSANDRA-10510.
-----------------------------------------
Resolution: Won't Fix
> Compacted SSTables failing to get removed, overflowing disk
> -----------------------------------------------------------
>
> Key: CASSANDRA-10510
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10510
> Project: Cassandra
> Issue Type: Bug
> Reporter: Björn Hegerfors
> Attachments: nonReleasedSSTables.txt
>
>
> Short version: it appears that if the resulting SSTable of a compaction
> enters another compaction soon after, the SSTables participating in the
> former compaction don't get deleted from disk until Cassandra is restarted.
> We have run into a big problem after applying CASSANDRA-10276 and
> CASSANDRA-10280, backported to 2.0.14. But the bug we're seeing is not
> introduced by these patches, it has just made itself very apparent and
> harmful.
> Here's what has happened. We had repair running on our table that is a time
> series and uses DTCS. The ring was split into 5016 small ranges being
> repaired one after the other (using parallel repair, i.e. not snapshot
> repair). This causes a flood of tiny SSTables to get streamed into all nodes
> (we don't use vnodes), with timestamp ranges similar to existing SSTables on
> disk. The problem with that is the sheer number of SSTables, disk usage is
> not affected. This has been reported before, see CASSANDRA-9644. These
> SSTables are streamed continuously for up to a couple of days.
> The patches were applied to fix the problem of ending up with tens of
> thousands of SSTables that would never get touched by DTCS. But now that DTCS
> does touch them, we have run into a new problem instead. While disk usage was
> in the 25-30% neighborhood before repairs began, disk usage started growing
> fast when these continuous streams started coming in. Eventually, a couple of
> nodes ran out of disk, which led us to stop all the repairing on the cluster.
> This didn't reduce the disk usage. Compactions were of course very active.
> More than doubling disk usage should not be possible, regardless of the
> choices your compaction strategy makes. And we were not getting magnitudes of
> data streamed in. Large quantities of SSTables, yes, but this was the nodes
> creating more data out of thin air.
> We have a tool to show timestamp and size metadata of SSTables. What we
> found, looking at all non-tmp data files, was something akin to duplicates of
> almost all the largest SSTables. Not quite exact replicas, but there were
> these multi-gigabyte SSTables covering exactly the same range of timestamps
> and differing in size by mere kilobytes. There were typically 3 of each of
> the largest SSTables, sometimes even more.
> Here's what I suspect: DTCS is the only compaction strategy that would
> commonly finish compacting a really large SSTable and on the very next run of
> the compaction strategy nominate the result for yet another compaction. Even
> together with tiny SSTables, which certainly happens in our scenario.
> Potentially, the large SSTable that participated in the first compaction
> might even get nominated again by DTCS, if for some reason it can be returned
> by getUncompactingSSTables.
> Whatever the reason, I have collected evidence showing that these large
> "duplicate" SSTables are of the same "lineage". Only one should remain on
> disk: the latest one. The older ones have already been compacted, resulting
> in the newer ones. But for some reason, they never got deleted from disk. And
> this was really harmful when combining DTCS with continuously streaming in
> tiny SSTables. The same but worse would happen without the patches and
> uncapped max_sstable_age_days.
> Attached is one occurrence of 3 duplicated SSTables, their metadata and log
> lines about their compactions. You can see how similar they were to each
> other. SSTable generations 374277, 374249, 373702 (the large one), 374305,
> 374231 and 374333 completed compaction at 04:05:26,878, yet they were all
> still on disk over 6 hours later. At 04:05:26,898 the result, 374373, entered
> another compaction with 375174. They also stayed around after that compaction
> finished. Literally all SSTables named in these log lines were still on disk
> when I checked! Only one should have remained: 375189.
> Now this was just one random example from the data I collected. This happened
> everywhere. Some SSTables should probably have been deleted a day before.
> However, once we restarted the nodes, all of the duplicates were suddenly
> gone!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)