I'm less concerned with what the defaults are in each branch, and
more the accuracy of what we say, e.g. in NEWS.txt
This is my understanding so far, and where I hoped to be corrected.
1. Rejecting writes does not prevent data loss in this situation.
It only reduces it. The investigation and remediation of possible
mislocated data is still required.
2. Rejecting writes is a louder form of alerting for users unaware
of the scenario, those not already monitoring logs or metrics.
3. Rejecting writes does not capture all places where the problem
is occurring. Only logging/metrics fully captures everywhere the
problem is occurring.
4. This situation can be a consequence of other problems (C* or
operational), not only range movements and the nature of gossip.
(2) is the primary argument I see for setting rejection to
default. We need to inform the user that data mislocation can
still be happening, and the only way to fully capture it is via
monitoring of enabled logging/metrics. We can also provide
information about when range movements can cause this, and that
nodes can be rejecting writes when they are in fact correct hence
causing “over-eager unavailability”. And furthermore, point people
to TCM.
On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan
<jeremiah.jor...@gmail.com> wrote:
JD we know it had nothing to do with range movements and
could/should have been prevented far simpler with operational
correctness/checks.
“Be better” is not the answer. Also I think you are confusing
our incidents, the out of range token issue we saw was not
because of an operational “oops” that could have been avoided.
In the extreme, when no writes have gone to any of the
replicas, what happened ? Either this was CL.*ONE, or it was
an operational failure (not C* at fault). If it's an
operational fault, both the coordinator and the node can be
wrong. With CL.ONE, just the coordinator can be wrong and the
problem still exists (and with rejection enabled the operator
is now more likely to ignore it).
If some node has a bad ring state it can easily send no writes
to the correct place, no need for CL ONE, with the current
system behavior CL ALL will be successful, with all the nodes
sent a mutation happily accepting and acking data they do not own.
Yes, even with this patch if you are using CL ONE, if the
coordinator has a faulty ring state where no replica is “real”
and it also decides that it is one of the replicas, then you
will have a successful write, even though no correct node got
the data. If you are using CL ONE you already know you are
taking on a risk. Not great, but there should be evidence in
other nodes of the bad thing occurring at the least. Also for
this same ring state, for any CL > ONE with the patch the write
would fail (assuming only a single node has the bad ring state).
Even when the fix is only partial, so really it's more about
more forcefully alerting the operator to the problem via
over-eager unavailability …?
Not sure why you are calling this “over-eager unavailability”.
If the data is going to the wrong nodes then the nodes may as
well be down. Unless the end user is writing at CL ANY they
have requested to be ACKed when CL nodes which own the data
have acked getting it.
-Jeremiah
On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever <m...@apache.org>
wrote:
Great that the discussion explores the issue as well.
So far we've heard three* companies being impacted, and four
times in total…? Info is helpful here.
*) Jordan, you say you've been hit by _other_ bugs _like_ it.
Jon i'm assuming the company you refer to doesn't overlap. JD
we know it had nothing to do with range movements and
could/should have been prevented far simpler with operational
correctness/checks.
In the extreme, when no writes have gone to any of the
replicas, what happened ? Either this was CL.*ONE, or it was
an operational failure (not C* at fault). If it's an
operational fault, both the coordinator and the node can be
wrong. With CL.ONE, just the coordinator can be wrong and the
problem still exists (and with rejection enabled the operator
is now more likely to ignore it).
WRT to the remedy, is it not to either run repair (when 1+
replica has it), or to load flushed and recompacted sstables
(from the period in question) to their correct nodes. This is
not difficult, but understandably lost-sleep and time-intensive.
Neither of the above two points I feel are that material to
the outcome, but I think it helps keep the discussion on track
and informative. We also know there are many competent
operators out there that do detect data loss.
On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe
<calebrackli...@gmail.com> wrote:
If we don’t reject by default, but log by default, my fear
is that we’ll simply be alerting the operator to something
that has already gone very wrong that they may not be in
any position to ever address.
On Sep 12, 2024, at 12:44 PM, Jordan West
<jw...@apache.org> wrote:
I’m +1 on enabling rejection by default on all branches.
We have been bit by silent data loss (due to other bugs
like the schema issues in 4.1) from lack of rejection on
several occasions and short of writing extremely
specialized tooling its unrecoverable. While both lack of
availability and data loss are critical, I will always
pick lack of availability over data loss. Its better to
fail a write that will be lost than silently lose it.
Of course, a change like this requires very good
communication in NEWS.txt and elsewhere but I think its
well worth it. While it may surprise some users I think
they would be more surprised that they were silently
losing data.
Jordan
On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever
<m...@apache.org> wrote:
Thanks for starting the thread Caleb, it is a big and
impacting patch.
Appreciate the criticality, in a new major release
rejection by default is obvious. Otherwise the
logging and metrics is an important addition to help
users validate the existence and degree of any problem.
Also worth mentioning that rejecting writes can cause
degraded availability in situations that pose no
problem. This is a coordination problem on a
probabilistic design, it's choose your evil:
unnecessary degraded availability or mislocated data
(eventual data loss). Logging and metrics makes
alerting on and handling the data mislocation
possible, i.e. avoids data loss with manual
intervention. (Logging and metrics also face the
same problem with false positives.)
I'm +0 for rejection default in 5.0.1, and +1 for
only logging default in 4.x
On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa
<jji...@gmail.com> wrote:
This patch is so hard for me.
The safety it adds is critical and should have
been added a decade ago.
Also it’s a huge patch, and touches “everything”.
It definitely belongs in 5.0. I’d probably reject
by default in 5.0.1.
4.0 / 4.1 - if we treat this like a fix for
latent opportunity for data loss (which it
implicitly is), I guess?
> On Sep 12, 2024, at 9:46 AM, Brandon Williams
<dri...@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
> <calebrackli...@gmail.com> wrote:
>>
>> Are you opposed to the patch in its entirety,
or just rejecting unsafe operations by default?
>
> I had the latter in mind. Changing any default
in a patch release is
> a potential surprise for operators and one of
this nature especially
> so.
>
> Kind Regards,
> Brandon