I agree with folks saying that we absolutely need to reject misplaced writes. 
It may not preclude coordinator making a local write, or making a write to a 
local replica, but even reducing probability of a misplaced write shown as 
success to the client is a substantial win. 

On Fri, Sep 13, 2024, at 10:41 AM, Mick Semb Wever wrote:
> replies below (to Scott, Josh and Jeremiah).
> 
> tl;dr all my four points remain undisputed, when the patch is applied.   This 
> is a messy situation, but no denying the value of rejection writes to various 
> known popular scenarios.  Point (1) remains important to highlight IMHO.
> 
> 
> On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net> wrote:
>> Since that time, I’ve received several urgent messages from major users of 
>> Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
>> about this bug. Some were able to verify the presence of lost data in 
>> SSTables on nodes where it didn’t belong, demonstrate empty read responses 
>> for data that is known proof-positive to exist (think content-addressable 
>> stores), or reproduce this behavior in a local cluster after forcing 
>> disagreement.
>> 
> 
> 
> 
> Having been privy to the background of those "urgent messages" I can say the 
> information you received wasn't correct (or complete). 
> 
> My challenge on this thread is about understanding where this might 
> unexpectedly bite users, which should be part of our due diligence when 
> applying such patches to stable branches.   I ask you to run through my four 
> points, which AFAIK still stand true.  
> 
> 
>> But I **don't** think it's healthy for us to repeatedly re-litigate whether 
>> data loss is acceptable based on how long it's been around, or how 
>> frequently some of us on the project have observed some given phenomenon. 
> 
> 
> Josh, that's true, but talking to these things helps open up the discussion, 
> see my point above wrt being aware of second-hand evidence that was 
> inaccurate.
> 
> 
>> The severity and frequency of this issue combined with the business risk to 
>> Apache Cassandra users changed my mind about fixing it in earlier branches 
>> despite TCM having been merged to fix it for good on trunk.
>> 
> 
> 
> That shouldn't prevent us from investigating known edge-cases, collateral 
> damage, and unexpected behavioural changes in patch versions.   
> 
>  
> 
>  
>>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> 
>>> wrote:
>>>> 1. Rejecting writes does not prevent data loss in this situation.  It only 
>>>> reduces it.  The investigation and remediation of possible mislocated data 
>>>> is still required.
>>> All nodes which reject a write prevent mislocated data.  There is still the 
>>> possibility of some node having the same wrong view of the ring as the 
>>> coordinator (including if they are the same node) accepting data.  Unless 
>>> there are multiple nodes with the same wrong view of the ring, data loss is 
>>> prevented for CL > ONE.
> 
> 
> (1) stands true, for all CLs.  I think this is pretty important here.
> 
> With writes rejection enabled, we can tell people it may have prevented a lot 
> of data mislocation and is of great benefit and safety, but there's no 
> guarantee that it's prevented all data mislocation.  If an operator 
> encounters writes rejected in this manner they must still go investigate a 
> possible data loss situation.  
> 
> We are aware of our own situations where we have been hit by this, and they 
> come in a number of variants, but we can't speak to every situation users 
> will find themselves in.   We're making a trade-off here of reduced 
> availability against more forceful alerting and an alleviation of data 
> mislocation.   
> 
> 
> 
>>> 
>>>> 2. Rejecting writes is a louder form of alerting for users unaware of the 
>>>> scenario, those not already monitoring logs or metrics.
>>> Without this patch no one is aware of any issues at all.  Maybe you are 
>>> referring to a situation where the patch is applied, but the default 
>>> behavior is to still accept the “bad” data?  In that case yes, turning on 
>>> rejection makes it “louder” in that your queries can fail if too many nodes 
>>> are wrong.
> 
> (2) stands true.    Rejecting is a louder alert, but it is not complete, see 
> next point.  (All four points are made with the patch applied.)
> 
>  
>>> 
>>>> 3. Rejecting writes does not capture all places where the problem is 
>>>> occurring.  Only logging/metrics fully captures everywhere the problem is 
>>>> occurring.
>>> 
>>> Not sure what you are saying here.
> 
> Rejected writes can be swallowed by a coordinator sending background writes 
> to other nodes when it has already ack'd the response to the client.  If the 
> operator wants a complete and accurate overview of out-of-range writes they 
> have to look at the logs/metrics.
> 
> (3) stands true.
> 
>  
>>> 
>>>> 4. … nodes can be rejecting writes when they are in fact correct hence 
>>>> causing “over-eager unavailability”.
>>> When would this occur?  I guess when the node with the bad ring information 
>>> is a replica sent data from a coordinator with the correct ring state?  
>>> There would be no “unavailability” here unless there were multiple nodes in 
>>> such a state.  I also again would not call this over eager, because the 
>>> node with the bad ring state is f’ed up and needs to be fixed.  So if being 
>>> considered unavailable doesn’t seem over-eager to me.
> 
> This fails in a quorum write.  And the node need not be f'ed up, just delayed 
> in its view.
> 
> (4) stands true.
> 
> 
>>> 
>>> Given the fact that a user can read NEWS.txt and turn off this rejection of 
>>> writes, I see no reason not to err on the side of “the setting which gives 
>>> better protection even if it is not perfect”.  We should not let the want 
>>> to solve everything prevent incremental improvements, especially when we 
>>> actually do have the solution coming in TCM.
> 
> 
> Yes, this is what I'm aiming at, to be truthful that it's best-effort at 
> alerting and alleviating some very serious scenarios.  It can prevent data 
> mislocation in some scenarios but it offers no guarantees about that, and can 
> also degrade availability unnecessarily.  Through production experience from 
> a number of large cluster operators we know it does significantly and 
> importantly improve the consistency and durability of data, but ultimately an 
> operator finding themselves in this situation must still assume possible 
> eventual data loss and investigate accordingly.    Due diligence and 
> accuracy.  
> 
>  

Reply via email to