Björn Hegerfors created CASSANDRA-10510:
-------------------------------------------

             Summary: Compated SSTables failing to get removed, overflowing disk
                 Key: CASSANDRA-10510
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10510
             Project: Cassandra
          Issue Type: Bug
            Reporter: Björn Hegerfors
         Attachments: nonReleasedSSTables.txt

Short version: it appears that if the resulting SSTable of a compaction enters 
another compaction soon after, the SSTables participating in the former 
compaction don't get deleted from disk until Cassandra is restarted.

We have run into a big problem after applying CASSANDRA-10276 and 
CASSANDRA-10280, backported to 2.0.14. But the bug we're seeing is not 
introduced by these patches, it has just made itself very apparent and harmful.

Here's what has happened. We had repair running on our table that is a time 
series and uses DTCS. The ring was split into 5016 small ranges being repaired 
one after the other (using parallel repair, i.e. not snapshot repair). This 
causes a flood of tiny SSTables to get streamed into all nodes (we don't use 
vnodes), with timestamp ranges similar to existing SSTables on disk. The 
problem with that is the sheer number of SSTables, disk usage is not affected. 
This has been reported before, see CASSANDRA-9644. These SSTables are streamed 
continuously for up to a couple of days.

The patches were applied to fix the problem of ending up with tens of thousands 
of SSTables that would never get touched by DTCS. But now that DTCS does touch 
them, we have run into a new problem instead. While disk usage was in the 
25-30% neighborhood before repairs began, disk usage started growing fast when 
these continuous streams started coming in. Eventually, a couple of nodes ran 
out of disk, which led us to stop all the repairing on the cluster.

This didn't reduce the disk usage. Compactions were of course very active. More 
than doubling disk usage should not be possible, regardless of the choices your 
compaction strategy makes. And we were not getting magnitudes of data streamed 
in. Large quantities of SSTables, yes, but this was the nodes creating more 
data out of thin air.

We have a tool to show timestamp and size metadata of SSTables. What we found, 
looking at all non-tmp data files, was something akin to duplicates of almost 
all the largest SSTables. Not quite exact replicas, but there were these 
multi-gigabyte SSTables covering exactly the same range of timestamps and 
differing in size by mere kilobytes. There were typically 3 of each of the 
largest SSTables, sometimes even more.

Here's what I suspect: DTCS is the only compaction strategy that would commonly 
finish compacting a really large SSTable and on the very next run of the 
compaction strategy nominate the result for yet another compaction. Even 
together with tiny SSTables, which certainly happens in our scenario. 
Potentially, the large SSTable that participated in the first compaction might 
even get nominated again by DTCS, if for some reason it can be returned by 
getUncompactingSSTables.

Whatever the reason, I have collected evidence showing that these large 
"duplicate" SSTables are of the same "lineage". Only one should remain on disk: 
the latest one. The older ones have already been compacted, resulting in the 
newer ones. But for some reason, they never got deleted from disk. And this was 
really harmful when combining DTCS with continuously streaming in tiny 
SSTables. The same but worse would happen without the patches and uncapped 
max_sstable_age_days.

Attached is one occurrence of 3 duplicated SSTables, their metadata and log 
lines about their compactions. You can see how similar they were to each other. 
SSTable generations 374277, 374249, 373702 (the large one), 374305, 374231 and 
374333 completed compaction at 04:05:26,878, yet they were all still on disk 
over 6 hours later. At 04:05:26,898 the result, 374373, entered another 
compaction with 375174. They also stayed around after that compaction finished. 
Literally all SSTables named in these log lines were still on disk when I 
checked! Only one should have remained: 375189.

Now this was just one random example from the data I collected. This happened 
everywhere. Some SSTables should probably have been deleted a day before.

However, once we restarted the nodes, all of the duplicates were suddenly gone!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to