Adding a poison-pill error option on finding of corrupt data makes sense to me. Not sure if there's enough demand / other customization being done in this space to justify the user customizable aspect; any immediate other approaches come to mind? If not, this isn't an area of the code that's changed all that much, so just adding a new option seems surgical and minimal to me.
On Tue, Dec 12, 2023, at 4:21 AM, Claude Warren, Jr via dev wrote: > I can see this as a strong improvement in Cassandra management and support > it. > > +1 non binding > > On Mon, Dec 11, 2023 at 8:28 PM Raymond Huffman <raymondmhuff...@gmail.com> > wrote: >> Hello All, >> >> On our fork of Cassandra, we've implemented some custom behavior for >> handling CommitLog and SSTable Corruption errors. Specifically, if a node >> detects one of those errors, we want the node to stop itself, and if the >> node is restarted, we want initialization to fail. This is useful in >> Kubernetes when you expect nodes to be restarted frequently and makes our >> corruption remediation workflows less error-prone. I think we could make >> this behavior more pluggable by allowing users to provide custom >> implementations of the FSErrorHandler, and the error handler that's >> currently implemented at >> org.apache.cassandra.db.commitlog.CommitLog#handleCommitError via config in >> the same way one can provide custom Partitioners and >> Authenticators/Authorizers. >> >> Would you take as a contribution one of the following? >> 1. user provided implementations of FSErrorHandler and >> CommitLogErrorHandler, set via config; and/or >> 2. new commit failure and disk failure policies that write a poison pill >> file to disk and fail on startup if that file exists >> >> The poison pill implementation is what we currently use - we call this a >> "Non Transient Error" and we want these states to always require manual >> intervention to resolve, including manual action to clear the error. I'd be >> happy to contribute this if other users would find it beneficial. I had >> initially shared this question in Slack, but I'm now sharing it here for >> broader visibility. >> >> -Raymond Huffman