I can see this as a strong improvement in Cassandra management and support it.
+1 non binding On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <raymondmhuff...@gmail.com> wrote: > Hello All, > > On our fork of Cassandra, we've implemented some custom behavior for > handling CommitLog and SSTable Corruption errors. Specifically, if a node > detects one of those errors, we want the node to stop itself, and if the > node is restarted, we want initialization to fail. This is useful in > Kubernetes when you expect nodes to be restarted frequently and makes our > corruption remediation workflows less error-prone. I think we could make > this behavior more pluggable by allowing users to provide custom > implementations of the FSErrorHandler, and the error handler that's > currently implemented at > org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in > the same way one can provide custom Partitioners and > Authenticators/Authorizers. > > Would you take as a contribution one of the following? > 1. user provided implementations of FSErrorHandler and > CommitLogErrorHandler, set via config; and/or > 2. new commit failure and disk failure policies that write a poison pill > file to disk and fail on startup if that file exists > > The poison pill implementation is what we currently use - we call this a > "Non Transient Error" and we want these states to always require manual > intervention to resolve, including manual action to clear the error. I'd be > happy to contribute this if other users would find it beneficial. I had > initially shared this question in Slack, but I'm now sharing it here for > broader visibility. > > -Raymond Huffman >