Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Josh McKenzie Thu, 12 Sep 2024 12:55:51 -0700

I'd like to propose we treat all data loss bugs as "fix by default on all 
supported branches even if that might introduce user-facing changes".


Even if only N of M people on a thread have experienced it.
Even if we only uncover it through testing (looking at you Harry).

My gut tells me this is something we should have a clear cultural value system 
around as a project, and that value system should be "Above all else, we don't 
lose data". Just because users aren't aware it might be happening doesn't mean 
it's not a *massive* problem.

I would bet good money that there are *a lot* of user-felt pains using this 
project that we're all unfortunately insulated from. 

On Thu, Sep 12, 2024, at 3:35 PM, Mick Semb Wever wrote:
> Great that the discussion explores the issue as well.
> 
> So far we've heard three* companies being impacted, and four times in total…? 
>  Info is helpful here.  
> 
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm 
> assuming the company you refer to doesn't overlap. JD we know it had nothing 
> to do with range movements and could/should have been prevented far simpler 
> with operational correctness/checks.
> 
> In the extreme, when no writes have gone to any of the replicas, what 
> happened ? Either this was CL.*ONE, or it was an operational failure (not C* 
> at fault).  If it's an operational fault, both the coordinator and the node 
> can be wrong.  With CL.ONE, just the coordinator can be wrong and the problem 
> still exists (and with rejection enabled the operator is now more likely to 
> ignore it).
> 
> WRT to the remedy, is it not to either run repair (when 1+ replica has it), 
> or to load flushed and recompacted sstables (from the period in question) to 
> their correct nodes.  This is not difficult, but understandably lost-sleep 
> and time-intensive.
> 
> Neither of the above two points I feel are that material to the outcome, but 
> I think it helps keep the discussion on track and informative.   We also know 
> there are many competent operators out there that do detect data loss.
> 
> 
> 
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe <calebrackli...@gmail.com> 
> wrote:
>> If we don’t reject by default, but log by default, my fear is that we’ll 
>> simply be alerting the operator to something that has already gone very 
>> wrong that they may not be in any position to ever address.
>> 
>>> On Sep 12, 2024, at 12:44 PM, Jordan West <jw...@apache.org> wrote:
>>> 
>>> I’m +1 on enabling rejection by default on all branches. We have been bit 
>>> by silent data loss (due to other bugs like the schema issues in 4.1) from 
>>> lack of rejection on several occasions and short of writing extremely 
>>> specialized tooling its unrecoverable. While both lack of availability and 
>>> data loss are critical, I will always pick lack of availability over data 
>>> loss. Its better to fail a write that will be lost than silently lose it. 
>>> 
>>> Of course, a change like this requires very good communication in NEWS.txt 
>>> and elsewhere but I think its well worth it. While it may surprise some 
>>> users I think they would be more surprised that they were silently losing 
>>> data. 
>>> 
>>> Jordan 
>>> 
>>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever <m...@apache.org> wrote:
>>>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>>> 
>>>> Appreciate the criticality, in a new major release rejection by default is 
>>>> obvious.   Otherwise the logging and metrics is an important addition to 
>>>> help users validate the existence and degree of any problem.  
>>>> 
>>>> Also worth mentioning that rejecting writes can cause degraded 
>>>> availability in situations that pose no problem.  This is a coordination 
>>>> problem on a probabilistic design, it's choose your evil: unnecessary 
>>>> degraded availability or mislocated data (eventual data loss).   Logging 
>>>> and metrics makes alerting on and handling the data mislocation possible, 
>>>> i.e. avoids data loss with manual intervention.  (Logging and metrics also 
>>>> face the same problem with false positives.)
>>>> 
>>>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in 
>>>> 4.x
>>>> 
>>>> 
>>>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>> This patch is so hard for me. 
>>>>> 
>>>>> The safety it adds is critical and should have been added a decade ago.
>>>>> Also it’s a huge patch, and touches “everything”. 
>>>>> 
>>>>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1.  
>>>>> 
>>>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data 
>>>>> loss (which it implicitly is), I guess?
>>>>> 
>>>>> 
>>>>> 
>>>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams <dri...@gmail.com> wrote:
>>>>> > 
>>>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>>>>> > <calebrackli...@gmail.com> wrote:
>>>>> >> 
>>>>> >> Are you opposed to the patch in its entirety, or just rejecting unsafe 
>>>>> >> operations by default?
>>>>> > 
>>>>> > I had the latter in mind.  Changing any default in a patch release is
>>>>> > a potential surprise for operators and one of this nature especially
>>>>> > so.
>>>>> > 
>>>>> > Kind Regards,
>>>>> > Brandon

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to