replies below (to Scott, Josh and Jeremiah).

tl;dr all my four points remain undisputed, when the patch is applied.
This is a messy situation, but no denying the value of rejection writes to
various known popular scenarios.  Point (1) remains important to highlight
IMHO.


On Fri, 13 Sept 2024 at 03:03, C. Scott Andreas <sc...@paradoxica.net>
wrote:

> Since that time, I’ve received several urgent messages from major users of
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking
> about this bug. Some were able to verify the presence of lost data in
> SSTables on nodes where it didn’t belong, demonstrate empty read responses
> for data that is known proof-positive to exist (think content-addressable
> stores), or reproduce this behavior in a local cluster after forcing
> disagreement.
>



Having been privy to the background of those "urgent messages" I can say
the information you received wasn't correct (or complete).

My challenge on this thread is about understanding where this might
unexpectedly bite users, which should be part of our due diligence when
applying such patches to stable branches.   I ask you to run through my
four points, which AFAIK still stand true.


But I *don't* think it's healthy for us to repeatedly re-litigate whether
> data loss is acceptable based on how long it's been around, or how
> frequently some of us on the project have observed some given phenomenon.


Josh, that's true, but talking to these things helps open up the
discussion, see my point above wrt being aware of second-hand evidence that
was inaccurate.


The severity and frequency of this issue combined with the business risk to
> Apache Cassandra users changed my mind about fixing it in earlier branches
> despite TCM having been merged to fix it for good on trunk.
>


That shouldn't prevent us from investigating known edge-cases, collateral
damage, and unexpected behavioural changes in patch versions.





> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com>
> wrote:
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
>> reduces it.  The investigation and remediation of possible mislocated data
>> is still required.
>>
>
> All nodes which reject a write prevent mislocated data.  There is still
> the possibility of some node having the same wrong view of the ring as the
> coordinator (including if they are the same node) accepting data.  Unless
> there are multiple nodes with the same wrong view of the ring, data loss is
> prevented for CL > ONE.
>
>

(1) stands true, for all CLs.  I think this is pretty important here.

With writes rejection enabled, we can tell people it may have prevented a
lot of data mislocation and is of great benefit and safety, but there's no
guarantee that it's prevented all data mislocation.  If an operator
encounters writes rejected in this manner they must still go investigate a
possible data loss situation.

We are aware of our own situations where we have been hit by this, and they
come in a number of variants, but we can't speak to every situation users
will find themselves in.   We're making a trade-off here of reduced
availability against more forceful alerting and an alleviation of data
mislocation.


2. Rejecting writes is a louder form of alerting for users unaware of the
>> scenario, those not already monitoring logs or metrics.
>>
>
> Without this patch no one is aware of any issues at all.  Maybe you are
> referring to a situation where the patch is applied, but the default
> behavior is to still accept the “bad” data?  In that case yes, turning on
> rejection makes it “louder” in that your queries can fail if too many nodes
> are wrong.
>
>
(2) stands true.    Rejecting is a louder alert, but it is not complete,
see next point.  (All four points are made with the patch applied.)



> 3. Rejecting writes does not capture all places where the problem is
>> occurring.  Only logging/metrics fully captures everywhere the problem is
>> occurring.
>>
>
> Not sure what you are saying here.
>
>
Rejected writes can be swallowed by a coordinator sending background writes
to other nodes when it has already ack'd the response to the client.  If
the operator wants a complete and accurate overview of out-of-range writes
they have to look at the logs/metrics.

(3) stands true.



> 4. … nodes can be rejecting writes when they are in fact correct hence
>> causing “over-eager unavailability”.
>>
>
> When would this occur?  I guess when the node with the bad ring
> information is a replica sent data from a coordinator with the correct ring
> state?  There would be no “unavailability” here unless there were multiple
> nodes in such a state.  I also again would not call this over eager,
> because the node with the bad ring state is f’ed up and needs to be fixed.
> So if being considered unavailable doesn’t seem over-eager to me.
>
>
This fails in a quorum write.  And the node need not be f'ed up, just
delayed in its view.

(4) stands true.


Given the fact that a user can read NEWS.txt and turn off this rejection of
> writes, I see no reason not to err on the side of “the setting which gives
> better protection even if it is not perfect”.  We should not let the want
> to solve everything prevent incremental improvements, especially when we
> actually do have the solution coming in TCM.
>
>

Yes, this is what I'm aiming at, to be truthful that it's best-effort at
alerting and alleviating some very serious scenarios.  It can prevent data
mislocation in some scenarios but it offers no guarantees about that, and
can also degrade availability unnecessarily.  Through production experience
from a number of large cluster operators we know it does significantly and
importantly improve the consistency and durability of data, but ultimately
an operator finding themselves in this situation must still assume possible
eventual data loss and investigate accordingly.    Due diligence and
accuracy.

Reply via email to