It is actually more complicated than just removing the sstable and running 
repair.

In the face of expired tombstones that might be covering data in other sstables 
the only safe way to deal with a bad sstable is wipe the token range in the bad 
sstable and rebuild/bootstrap that range (or wipe/rebuild the whole node which 
is usually the easier way).  If there are expired tombstones in play, it means 
they could have already been compacted away on the other replicas, but may not 
have compacted away on the current replica, meaning the data they cover could 
still be present in other sstables on this node.  Removing the sstable will 
mean resurrecting that data.  And pulling the range from other nodes does not 
help because they can have already compacted away the tombstone, so you won’t 
get it back.

Tl;DR you can’t just remove the one sstable you have to remove all data in the 
token range covered by the sstable (aka all data that sstable may have had a 
tombstone covering).  Then you can stream from the other nodes to get the data 
back.

-Jeremiah

> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev <dev@cassandra.apache.org> 
> wrote:
> 
> At the moment, when a read error, such as unrecoverable bit error or data 
> corruption, occurs in the SSTable data files, regardless of the 
> disk_failure_policy configuration, manual (or to be precise, external) 
> intervention is required to recover from the error.
> 
> Commonly, there's two approach to recover from such error:
> 
> The safer, but slower recover strategy: replace the entire node.
> The less safe, but faster recover strategy: shut down the node, delete the 
> affected SSTable file(s), and then bring the node back online and run repair.
> Based on my understanding of Cassandra, it should be possible to recover from 
> such error by marking the affected token range in the existing SSTable as 
> "corrupted" and stop reading from them (e.g. creating a "bad block" file or 
> in memory), and then streaming the affected token range from the healthy 
> replicas. The corrupted SSTable file can then be removed upon the next 
> successful compaction involving it, or alternatively an anti-compaction is 
> performed on it to remove the corrupted data.
> 
> The advantage of this strategy is:
> 
> Reduced node down time - node restart or replacement is not needed
> Less data streaming is required - only the affected token range
> Faster recovery time - less streaming and delayed compaction or 
> anti-compaction
> No less safe than replacing the entire node
> This process can be automated internally, removing the need for operator 
> inputs
> The disadvantage is added complexity on the SSTable read path and it may mask 
> disk failures from the operator who is not paying attention to it.
> 
> What do you think about this?
> 

Reply via email to