Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Berenguer Blasi Fri, 13 Sep 2024 00:55:51 -0700

+1 to rejecting on all branches. Yes fixing bugs and problems change howthings used to worked and some users will be surprised. But it's betterthan being surprised on an eventual data loss.


On 13/9/24 3:34, Josh McKenzie wrote:

Even when the fix is only partial, so really it's more about moreforcefully alerting the operator to the problem via over-eagerunavailability …?
Sometimes a principled stance can take us away from the importantdetails in the discussions.
My understanding of the ticket (having not dug deeply into the code,just reviewed the JIRA and this thread), is that this is as effectiveof a solution as we can have in a non-deterministic, non-epoch based,non-transactional metadata system. i.e. Gossip. I don't see it as apartial fix but I might be misunderstanding.
I'm not advocating for us having a rigid principled stance where wereject all nuance and don't discuss things. I'm advocating for uscoalescing on a shared */default/* stance of correctness unlessotherwise excepted. We know we're a diverse group, we're all differentpeople with different histories / values / opinions / cultures, and Ithink that's what makes this community as effective as it is.
But I /*don't*/ think it's healthy for us to repeatedly re-litigatewhether data loss is acceptable based on how long it's been around, orhow frequently some of us on the project have observed some givenphenomenon. My gut tells me we'd all be in a better place if we allstarted from 0 on a discussion like this as "Ok, data loss isunacceptable. Unless otherwise warranted, we should do all we can tofix this on all supported branches as our default response".
On Thu, Sep 12, 2024, at 9:02 PM, C. Scott Andreas wrote:
Thanks all for discussion on this.
It’s hard to describe the sinking feeling that hit me when it becameclear to me how common this problem is - and how horribly difficultit is to prove one has encountered this bug.
Two years ago, my understanding was that this is an exceptionallyrare and transient issue unlikely to occur after all the work we putinto gossip. My view was that gossip had basically been shorn up andthat Transactional Metadata is the proper fix for this with its epochdesign (which is true).
Since that time, I’ve received several urgent messages from majorusers of Apache Cassandra and even customers of Cassandra ecosystemvendors asking about this bug. Some were able to verify the presenceof lost data in SSTables on nodes where it didn’t belong, demonstrateempty read responses for data that is known proof-positive to exist(think content-addressable stores), or reproduce this behavior in alocal cluster after forcing disagreement.
The severity and frequency of this issue combined with the businessrisk to Apache Cassandra users changed my mind about fixing it inearlier branches despite TCM having been merged to fix it for good ontrunk.
The guards in this patch are extensive: point reads, range reads,mutations, repair, incoming / outgoing streams, hints, merkle treerequests, and others I’m forgetting. They’re simple guards, and whilethey touch many subsystems, they’re not invasive changes.
There is no reasonable scenario that’s common enough that wouldjustify disabling a guard preventing silent data loss by default. Iappreciate that a prop exists to permit or warn in the presence ofdata loss for anyone who may want that, in the spirit of users beingin control of their clusters’ behavior.
Very large operators may only see indications the guard took effectfor a handful of queries per day — but in instances where ownershipdisagreement is prolonged, the patch is an essential guard againstlarge-scale unrecoverable data loss and incorrect responses toqueries. I’ll further take the position that those few queries intransient disagreement scenarios would be justification by themselves.
I support merging the patch to all proposed branches and enabling theguard by default.
– Scott
On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan<jeremiah.jor...@gmail.com> wrote:
1. Rejecting writes does not prevent data loss in this situation. It only reduces it. The investigation and remediation of possiblemislocated data is still required.
All nodes which reject a write prevent mislocated data. There isstill the possibility of some node having the same wrong view of thering as the coordinator (including if they are the same node)accepting data. Unless there are multiple nodes with the same wrongview of the ring, data loss is prevented for CL > ONE.
2. Rejecting writes is a louder form of alerting for users unawareof the scenario, those not already monitoring logs or metrics.
Without this patch no one is aware of any issues at all. Maybe youare referring to a situation where the patch is applied, but thedefault behavior is to still accept the “bad” data? In that caseyes, turning on rejection makes it “louder” in that your queries canfail if too many nodes are wrong.
3. Rejecting writes does not capture all places where the problemis occurring. Only logging/metrics fully captures everywhere theproblem is occurring.
Not sure what you are saying here.
nodes can be rejecting writes when they are in fact correct hencecausing “over-eager unavailability”.
When would this occur? I guess when the node with the bad ringinformation is a replica sent data from a coordinator with thecorrect ring state? There would be no “unavailability” here unlessthere were multiple nodes in such a state. I also again would notcall this over eager, because the node with the bad ring state isf’ed up and needs to be fixed. So if being considered unavailabledoesn’t seem over-eager to me.
Given the fact that a user can read NEWS.txt and turn off thisrejection of writes, I see no reason not to err on the side of “thesetting which gives better protection even if it is not perfect”. We should not let the want to solve everything prevent incrementalimprovements, especially when we actually do have the solutioncoming in TCM.
-Jeremiah

On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever <m...@apache.org> wrote:
I'm less concerned with what the defaults are in each branch, andmore the accuracy of what we say, e.g. in NEWS.txt
This is my understanding so far, and where I hoped to be corrected.
1. Rejecting writes does not prevent data loss in this situation. It only reduces it. The investigation and remediation of possiblemislocated data is still required.
2. Rejecting writes is a louder form of alerting for users unawareof the scenario, those not already monitoring logs or metrics.
3. Rejecting writes does not capture all places where the problemis occurring. Only logging/metrics fully captures everywhere theproblem is occurring.
4. This situation can be a consequence of other problems (C* oroperational), not only range movements and the nature of gossip.
(2) is the primary argument I see for setting rejection todefault. We need to inform the user that data mislocation canstill be happening, and the only way to fully capture it is viamonitoring of enabled logging/metrics. We can also provideinformation about when range movements can cause this, and thatnodes can be rejecting writes when they are in fact correct hencecausing “over-eager unavailability”. And furthermore, point peopleto TCM.
On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan<jeremiah.jor...@gmail.com> wrote:
    JD we know it had nothing to do with range movements and
    could/should have been prevented far simpler with operational
    correctness/checks.
    “Be better” is not the answer.  Also I think you are confusing
    our incidents, the out of range token issue we saw was not
    because of an operational “oops” that could have been avoided.
    In the extreme, when no writes have gone to any of the
    replicas, what happened ? Either this was CL.*ONE, or it was
    an operational failure (not C* at fault).  If it's an
    operational fault, both the coordinator and the node can be
    wrong.  With CL.ONE, just the coordinator can be wrong and the
    problem still exists (and with rejection enabled the operator
    is now more likely to ignore it).
    If some node has a bad ring state it can easily send no writes
    to the correct place, no need for CL ONE, with the current
    system behavior CL ALL will be successful, with all the nodes
    sent a mutation happily accepting and acking data they do not own.

    Yes, even with this patch if you are using CL ONE, if the
    coordinator has a faulty ring state where no replica is “real”
    and it also decides that it is one of the replicas, then you
    will have a successful write, even though no correct node got
    the data.  If you are using CL ONE you already know you are
    taking on a risk.  Not great, but there should be evidence in
    other nodes of the bad thing occurring at the least.  Also for
    this same ring state, for any CL > ONE with the patch the write
    would fail (assuming only a single node has the bad ring state).
    Even when the fix is only partial, so really it's more about
    more forcefully alerting the operator to the problem via
    over-eager unavailability …?
    Not sure why you are calling this “over-eager unavailability”. 
    If the data is going to the wrong nodes then the nodes may as
    well be down.  Unless the end user is writing at CL ANY they
    have requested to be ACKed when CL nodes which own the data
    have acked getting it.

    -Jeremiah

    On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever <m...@apache.org>
    wrote:
    Great that the discussion explores the issue as well.

    So far we've heard three* companies being impacted, and four
    times in total…?  Info is helpful here.

    *) Jordan, you say you've been hit by _other_ bugs _like_ it. 
    Jon i'm assuming the company you refer to doesn't overlap. JD
    we know it had nothing to do with range movements and
    could/should have been prevented far simpler with operational
    correctness/checks.

    In the extreme, when no writes have gone to any of the
    replicas, what happened ? Either this was CL.*ONE, or it was
    an operational failure (not C* at fault).  If it's an
    operational fault, both the coordinator and the node can be
    wrong.  With CL.ONE, just the coordinator can be wrong and the
    problem still exists (and with rejection enabled the operator
    is now more likely to ignore it).

    WRT to the remedy, is it not to either run repair (when 1+
    replica has it), or to load flushed and recompacted sstables
    (from the period in question) to their correct nodes.  This is
    not difficult, but understandably lost-sleep and time-intensive.

    Neither of the above two points I feel are that material to
    the outcome, but I think it helps keep the discussion on track
    and informative.   We also know there are many competent
    operators out there that do detect data loss.



    On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe
    <calebrackli...@gmail.com> wrote:

        If we don’t reject by default, but log by default, my fear
        is that we’ll simply be alerting the operator to something
        that has already gone very wrong that they may not be in
        any position to ever address.
        On Sep 12, 2024, at 12:44 PM, Jordan West
        <jw...@apache.org> wrote:
        
        I’m +1 on enabling rejection by default on all branches.
        We have been bit by silent data loss (due to other bugs
        like the schema issues in 4.1) from lack of rejection on
        several occasions and short of writing extremely
        specialized tooling its unrecoverable. While both lack of
        availability and data loss are critical, I will always
        pick lack of availability over data loss. Its better to
        fail a write that will be lost than silently lose it.

        Of course, a change like this requires very good
        communication in NEWS.txt and elsewhere but I think its
        well worth it. While it may surprise some users I think
        they would be more surprised that they were silently
        losing data.

        Jordan

        On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever
        <m...@apache.org> wrote:

            Thanks for starting the thread Caleb, it is a big and
            impacting patch.

            Appreciate the criticality, in a new major release
            rejection by default is obvious. Otherwise the
            logging and metrics is an important addition to help
            users validate the existence and degree of any problem.

            Also worth mentioning that rejecting writes can cause
            degraded availability in situations that pose no
            problem. This is a coordination problem on a
            probabilistic design, it's choose your evil:
            unnecessary degraded availability or mislocated data
            (eventual data loss).   Logging and metrics makes
            alerting on and handling the data mislocation
            possible, i.e. avoids data loss with manual
            intervention.  (Logging and metrics also face the
            same problem with false positives.)

            I'm +0 for rejection default in 5.0.1, and +1 for
            only logging default in 4.x


            On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa
            <jji...@gmail.com> wrote:

                This patch is so hard for me.

                The safety it adds is critical and should have
                been added a decade ago.
                Also it’s a huge patch, and touches “everything”.

                It definitely belongs in 5.0. I’d probably reject
                by default in 5.0.1.

                4.0 / 4.1 - if we treat this like a fix for
                latent opportunity for data loss (which it
                implicitly is), I guess?



                > On Sep 12, 2024, at 9:46 AM, Brandon Williams
                <dri...@gmail.com> wrote:
                >
                > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
                > <calebrackli...@gmail.com> wrote:
                >>
                >> Are you opposed to the patch in its entirety,
                or just rejecting unsafe operations by default?
                >
                > I had the latter in mind.  Changing any default
                in a patch release is
                > a potential surprise for operators and one of
                this nature especially
                > so.
                >
                > Kind Regards,
                > Brandon

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to