I'd like to propose we treat all data loss bugs as "fix by default on all supported branches even if that might introduce user-facing changes".
Even if only N of M people on a thread have experienced it. Even if we only uncover it through testing (looking at you Harry). My gut tells me this is something we should have a clear cultural value system around as a project, and that value system should be "Above all else, we don't lose data". Just because users aren't aware it might be happening doesn't mean it's not a *massive* problem. I would bet good money that there are *a lot* of user-felt pains using this project that we're all unfortunately insulated from. On Thu, Sep 12, 2024, at 3:35 PM, Mick Semb Wever wrote: > Great that the discussion explores the issue as well. > > So far we've heard three* companies being impacted, and four times in total…? > Info is helpful here. > > *) Jordan, you say you've been hit by _other_ bugs _like_ it. Jon i'm > assuming the company you refer to doesn't overlap. JD we know it had nothing > to do with range movements and could/should have been prevented far simpler > with operational correctness/checks. > > In the extreme, when no writes have gone to any of the replicas, what > happened ? Either this was CL.*ONE, or it was an operational failure (not C* > at fault). If it's an operational fault, both the coordinator and the node > can be wrong. With CL.ONE, just the coordinator can be wrong and the problem > still exists (and with rejection enabled the operator is now more likely to > ignore it). > > WRT to the remedy, is it not to either run repair (when 1+ replica has it), > or to load flushed and recompacted sstables (from the period in question) to > their correct nodes. This is not difficult, but understandably lost-sleep > and time-intensive. > > Neither of the above two points I feel are that material to the outcome, but > I think it helps keep the discussion on track and informative. We also know > there are many competent operators out there that do detect data loss. > > > > On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe <calebrackli...@gmail.com> > wrote: >> If we don’t reject by default, but log by default, my fear is that we’ll >> simply be alerting the operator to something that has already gone very >> wrong that they may not be in any position to ever address. >> >>> On Sep 12, 2024, at 12:44 PM, Jordan West <jw...@apache.org> wrote: >>> >>> I’m +1 on enabling rejection by default on all branches. We have been bit >>> by silent data loss (due to other bugs like the schema issues in 4.1) from >>> lack of rejection on several occasions and short of writing extremely >>> specialized tooling its unrecoverable. While both lack of availability and >>> data loss are critical, I will always pick lack of availability over data >>> loss. Its better to fail a write that will be lost than silently lose it. >>> >>> Of course, a change like this requires very good communication in NEWS.txt >>> and elsewhere but I think its well worth it. While it may surprise some >>> users I think they would be more surprised that they were silently losing >>> data. >>> >>> Jordan >>> >>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever <m...@apache.org> wrote: >>>> Thanks for starting the thread Caleb, it is a big and impacting patch. >>>> >>>> Appreciate the criticality, in a new major release rejection by default is >>>> obvious. Otherwise the logging and metrics is an important addition to >>>> help users validate the existence and degree of any problem. >>>> >>>> Also worth mentioning that rejecting writes can cause degraded >>>> availability in situations that pose no problem. This is a coordination >>>> problem on a probabilistic design, it's choose your evil: unnecessary >>>> degraded availability or mislocated data (eventual data loss). Logging >>>> and metrics makes alerting on and handling the data mislocation possible, >>>> i.e. avoids data loss with manual intervention. (Logging and metrics also >>>> face the same problem with false positives.) >>>> >>>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default in >>>> 4.x >>>> >>>> >>>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa <jji...@gmail.com> wrote: >>>>> This patch is so hard for me. >>>>> >>>>> The safety it adds is critical and should have been added a decade ago. >>>>> Also it’s a huge patch, and touches “everything”. >>>>> >>>>> It definitely belongs in 5.0. I’d probably reject by default in 5.0.1. >>>>> >>>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for data >>>>> loss (which it implicitly is), I guess? >>>>> >>>>> >>>>> >>>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams <dri...@gmail.com> wrote: >>>>> > >>>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe >>>>> > <calebrackli...@gmail.com> wrote: >>>>> >> >>>>> >> Are you opposed to the patch in its entirety, or just rejecting unsafe >>>>> >> operations by default? >>>>> > >>>>> > I had the latter in mind. Changing any default in a patch release is >>>>> > a potential surprise for operators and one of this nature especially >>>>> > so. >>>>> > >>>>> > Kind Regards, >>>>> > Brandon