I agree with folks saying that we absolutely need to reject misplaced writes. It may not preclude coordinator making a local write, or making a write to a local replica, but even reducing probability of a misplaced write shown as success to the client is a substantial win.
On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote: > replies below (to Scott, Josh and Jeremiah). > > tl;dr all my four points remain undisputed, when the patch is applied. This > is a messy situation, but no denying the value of rejection writes to various > known popular scenarios. Point (1) remains important to highlight IMHO. > > > On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net> wrote: >> Since that time, I’ve received several urgent messages from major users of >> Apache Cassandra and even customers of Cassandra ecosystem vendors asking >> about this bug. Some were able to verify the presence of lost data in >> SSTables on nodes where it didn’t belong, demonstrate empty read responses >> for data that is known proof-positive to exist (think content-addressable >> stores), or reproduce this behavior in a local cluster after forcing >> disagreement. >> > > > > Having been privy to the background of those "urgent messages" I can say the > information you received wasn't correct (or complete). > > My challenge on this thread is about understanding where this might > unexpectedly bite users, which should be part of our due diligence when > applying such patches to stable branches. I ask you to run through my four > points, which AFAIK still stand true. > > >> But I **don't** think it's healthy for us to repeatedly re-litigate whether >> data loss is acceptable based on how long it's been around, or how >> frequently some of us on the project have observed some given phenomenon. > > > Josh, that's true, but talking to these things helps open up the discussion, > see my point above wrt being aware of second-hand evidence that was > inaccurate. > > >> The severity and frequency of this issue combined with the business risk to >> Apache Cassandra users changed my mind about fixing it in earlier branches >> despite TCM having been merged to fix it for good on trunk. >> > > > That shouldn't prevent us from investigating known edge-cases, collateral > damage, and unexpected behavioural changes in patch versions. > > > > >>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> >>> wrote: >>>> 1. Rejecting writes does not prevent data loss in this situation. It only >>>> reduces it. The investigation and remediation of possible mislocated data >>>> is still required. >>> All nodes which reject a write prevent mislocated data. There is still the >>> possibility of some node having the same wrong view of the ring as the >>> coordinator (including if they are the same node) accepting data. Unless >>> there are multiple nodes with the same wrong view of the ring, data loss is >>> prevented for CL > ONE. > > > (1) stands true, for all CLs. I think this is pretty important here. > > With writes rejection enabled, we can tell people it may have prevented a lot > of data mislocation and is of great benefit and safety, but there's no > guarantee that it's prevented all data mislocation. If an operator > encounters writes rejected in this manner they must still go investigate a > possible data loss situation. > > We are aware of our own situations where we have been hit by this, and they > come in a number of variants, but we can't speak to every situation users > will find themselves in. We're making a trade-off here of reduced > availability against more forceful alerting and an alleviation of data > mislocation. > > > >>> >>>> 2. Rejecting writes is a louder form of alerting for users unaware of the >>>> scenario, those not already monitoring logs or metrics. >>> Without this patch no one is aware of any issues at all. Maybe you are >>> referring to a situation where the patch is applied, but the default >>> behavior is to still accept the “bad” data? In that case yes, turning on >>> rejection makes it “louder” in that your queries can fail if too many nodes >>> are wrong. > > (2) stands true. Rejecting is a louder alert, but it is not complete, see > next point. (All four points are made with the patch applied.) > > >>> >>>> 3. Rejecting writes does not capture all places where the problem is >>>> occurring. Only logging/metrics fully captures everywhere the problem is >>>> occurring. >>> >>> Not sure what you are saying here. > > Rejected writes can be swallowed by a coordinator sending background writes > to other nodes when it has already ack'd the response to the client. If the > operator wants a complete and accurate overview of out-of-range writes they > have to look at the logs/metrics. > > (3) stands true. > > >>> >>>> 4. … nodes can be rejecting writes when they are in fact correct hence >>>> causing “over-eager unavailability”. >>> When would this occur? I guess when the node with the bad ring information >>> is a replica sent data from a coordinator with the correct ring state? >>> There would be no “unavailability” here unless there were multiple nodes in >>> such a state. I also again would not call this over eager, because the >>> node with the bad ring state is f’ed up and needs to be fixed. So if being >>> considered unavailable doesn’t seem over-eager to me. > > This fails in a quorum write. And the node need not be f'ed up, just delayed > in its view. > > (4) stands true. > > >>> >>> Given the fact that a user can read NEWS.txt and turn off this rejection of >>> writes, I see no reason not to err on the side of “the setting which gives >>> better protection even if it is not perfect”. We should not let the want >>> to solve everything prevent incremental improvements, especially when we >>> actually do have the solution coming in TCM. > > > Yes, this is what I'm aiming at, to be truthful that it's best-effort at > alerting and alleviating some very serious scenarios. It can prevent data > mislocation in some scenarios but it offers no guarantees about that, and can > also degrade availability unnecessarily. Through production experience from > a number of large cluster operators we know it does significantly and > importantly improve the consistency and durability of data, but ultimately an > operator finding themselves in this situation must still assume possible > eventual data loss and investigate accordingly. Due diligence and > accuracy. > >