Adding a poison-pill error option on finding of corrupt data makes sense to me. 
Not sure if there's enough demand / other customization being done in this 
space to justify the user customizable aspect; any immediate other approaches 
come to mind? If not, this isn't an area of the code that's changed all that 
much, so just adding a new option seems surgical and minimal to me.

On Tue, Dec 12, 2023, at 4:21 AM, Claude Warren, Jr via dev wrote:
> I can see this as a strong improvement in Cassandra management and support 
> it. 
> 
> +1 non binding
> 
> On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <raymondmhuff...@gmail.com> 
> wrote:
>> Hello All,
>> 
>> On our fork of Cassandra, we've implemented some custom behavior for 
>> handling CommitLog and SSTable Corruption errors. Specifically, if a node 
>> detects one of those errors, we want the node to stop itself, and if the 
>> node is restarted, we want initialization to fail. This is useful in 
>> Kubernetes when you expect nodes to be restarted frequently and makes our 
>> corruption remediation workflows less error-prone. I think we could make 
>> this behavior more pluggable by allowing users to provide custom 
>> implementations of the FSErrorHandler, and the error handler that's 
>> currently implemented at 
>> org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in 
>> the same way one can provide custom Partitioners and 
>> Authenticators/Authorizers.
>> 
>> Would you take as a contribution one of the following?
>> 1. user provided implementations of FSErrorHandler and 
>> CommitLogErrorHandler, set via config; and/or
>> 2. new commit failure and disk failure policies that write a poison pill 
>> file to disk and fail on startup if that file exists
>> 
>> The poison pill implementation is what we currently use - we call this a 
>> "Non Transient Error" and we want these states to always require manual 
>> intervention to resolve, including manual action to clear the error. I'd be 
>> happy to contribute this if other users would find it beneficial. I had 
>> initially shared this question in Slack, but I'm now sharing it here for 
>> broader visibility.
>> 
>> -Raymond Huffman

Reply via email to