replies below (to Scott, Josh and Jeremiah). tl;dr all my four points remain undisputed, when the patch is applied. This is a messy situation, but no denying the value of rejection writes to various known popular scenarios. Point (1) remains important to highlight IMHO.
On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net> wrote: > Since that time, I’ve received several urgent messages from major users of > Apache Cassandra and even customers of Cassandra ecosystem vendors asking > about this bug. Some were able to verify the presence of lost data in > SSTables on nodes where it didn’t belong, demonstrate empty read responses > for data that is known proof-positive to exist (think content-addressable > stores), or reproduce this behavior in a local cluster after forcing > disagreement. > Having been privy to the background of those "urgent messages" I can say the information you received wasn't correct (or complete). My challenge on this thread is about understanding where this might unexpectedly bite users, which should be part of our due diligence when applying such patches to stable branches. I ask you to run through my four points, which AFAIK still stand true. But I *don't* think it's healthy for us to repeatedly re-litigate whether > data loss is acceptable based on how long it's been around, or how > frequently some of us on the project have observed some given phenomenon. Josh, that's true, but talking to these things helps open up the discussion, see my point above wrt being aware of second-hand evidence that was inaccurate. The severity and frequency of this issue combined with the business risk to > Apache Cassandra users changed my mind about fixing it in earlier branches > despite TCM having been merged to fix it for good on trunk. > That shouldn't prevent us from investigating known edge-cases, collateral damage, and unexpected behavioural changes in patch versions. > On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> > wrote: > > 1. Rejecting writes does not prevent data loss in this situation. It only >> reduces it. The investigation and remediation of possible mislocated data >> is still required. >> > > All nodes which reject a write prevent mislocated data. There is still > the possibility of some node having the same wrong view of the ring as the > coordinator (including if they are the same node) accepting data. Unless > there are multiple nodes with the same wrong view of the ring, data loss is > prevented for CL > ONE. > > (1) stands true, for all CLs. I think this is pretty important here. With writes rejection enabled, we can tell people it may have prevented a lot of data mislocation and is of great benefit and safety, but there's no guarantee that it's prevented all data mislocation. If an operator encounters writes rejected in this manner they must still go investigate a possible data loss situation. We are aware of our own situations where we have been hit by this, and they come in a number of variants, but we can't speak to every situation users will find themselves in. We're making a trade-off here of reduced availability against more forceful alerting and an alleviation of data mislocation. 2. Rejecting writes is a louder form of alerting for users unaware of the >> scenario, those not already monitoring logs or metrics. >> > > Without this patch no one is aware of any issues at all. Maybe you are > referring to a situation where the patch is applied, but the default > behavior is to still accept the “bad” data? In that case yes, turning on > rejection makes it “louder” in that your queries can fail if too many nodes > are wrong. > > (2) stands true. Rejecting is a louder alert, but it is not complete, see next point. (All four points are made with the patch applied.) > 3. Rejecting writes does not capture all places where the problem is >> occurring. Only logging/metrics fully captures everywhere the problem is >> occurring. >> > > Not sure what you are saying here. > > Rejected writes can be swallowed by a coordinator sending background writes to other nodes when it has already ack'd the response to the client. If the operator wants a complete and accurate overview of out-of-range writes they have to look at the logs/metrics. (3) stands true. > 4. … nodes can be rejecting writes when they are in fact correct hence >> causing “over-eager unavailability”. >> > > When would this occur? I guess when the node with the bad ring > information is a replica sent data from a coordinator with the correct ring > state? There would be no “unavailability” here unless there were multiple > nodes in such a state. I also again would not call this over eager, > because the node with the bad ring state is f’ed up and needs to be fixed. > So if being considered unavailable doesn’t seem over-eager to me. > > This fails in a quorum write. And the node need not be f'ed up, just delayed in its view. (4) stands true. Given the fact that a user can read NEWS.txt and turn off this rejection of > writes, I see no reason not to err on the side of “the setting which gives > better protection even if it is not perfect”. We should not let the want > to solve everything prevent incremental improvements, especially when we > actually do have the solution coming in TCM. > > Yes, this is what I'm aiming at, to be truthful that it's best-effort at alerting and alleviating some very serious scenarios. It can prevent data mislocation in some scenarios but it offers no guarantees about that, and can also degrade availability unnecessarily. Through production experience from a number of large cluster operators we know it does significantly and importantly improve the consistency and durability of data, but ultimately an operator finding themselves in this situation must still assume possible eventual data loss and investigate accordingly. Due diligence and accuracy.