Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

guo Maxwell Thu, 12 Sep 2024 19:51:28 -0700

+1，默认在所有分支上启用拒绝。我们做了类似的事情来解决这个问题。虽然默认行为可能会改变。但我们解决的是数据存储的正确性，我认为这应该是数据库最重要的事情，这样其他事情可能就不那么重要了。


Josh McKenzie <jmcken...@apache.org> 于2024年9月13日周五 09:34写道：

> 即使修复只是部分的，那么实际上它更多的是通过过于急切的不可用性更有力地提醒操作员问题……？
>
> 有时原则立场可能会使我们远离讨论中的重要细节。
>
> 我对此票的理解（没有深入研究代码，只是查看了 JIRA
> 和此线程）是，这是我们在非确定性、非基于纪元、非事务性元数据系统中可以找到的最有效的解决方案。即
> Gossip。我不认为这是一个部分修复，但我可能误解了。
>
> 我并不主张我们采取僵化的原则立场，拒绝一切细微差别，不讨论任何事物。我主张我们团结一致，除非有其他例外，否则都坚持共同的*默认*
> 正确立场。我们知道我们是一个多元化的群体，我们都是不同的人，有着不同的历史/价值观/观点/文化，我认为这就是这个社区如此有效的原因。
>
> 但我 
> 认为，根据数据丢失发生的时间长短或项目中某些人观察到某种现象的频率，反复重新争论数据丢失是否可以接受，这对我们*并不*健康。我的直觉告诉我，如果我们都从
> 0
> 开始讨论，比如“好吧，数据丢失是不可接受的。除非另有保证，否则我们应该尽一切努力在所有支持的分支上修复此问题，作为我们的默认响应”，那么我们都会处于更好的位置。
>
> 2024 年 9 月 12 日星期四晚上 9:02，C. Scott Andreas 写道：
>
> 感谢大家对此的讨论。
>
>
> 当我意识到这个问题有多么普遍，以及要证明自己遇到了这个错误有多么困难时，我很难描述那种沉重的感觉。
>
>
> 两年前，我的理解是，这是一个极其罕见且短暂的问题，在我们为 Gossip 投入大量精力之后，不太可能发生。我的观点是，Gossip
> 基本上已经被解决了，而事务元数据正是通过其纪元设计来解决这个问题的正确方法（这是真的）。
>
>
> Since that time, I’ve received several urgent messages from major users of
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking
> about this bug. Some were able to verify the presence of lost data in
> SSTables on nodes where it didn’t belong, demonstrate empty read responses
> for data that is known proof-positive to exist (think content-addressable
> stores), or reproduce this behavior in a local cluster after forcing
> disagreement.
>
>
> The severity and frequency of this issue combined with the business risk
> to Apache Cassandra users changed my mind about fixing it in earlier
> branches despite TCM having been merged to fix it for good on trunk.
>
>
> The guards in this patch are extensive: point reads, range reads,
> mutations, repair, incoming / outgoing streams, hints, merkle tree
> requests, and others I’m forgetting. They’re simple guards, and while they
> touch many subsystems, they’re not invasive changes.
>
>
> There is no reasonable scenario that’s common enough that would justify
> disabling a guard preventing silent data loss by default. I appreciate that
> a prop exists to permit or warn in the presence of data loss for anyone who
> may want that, in the spirit of users being in control of their clusters’
> behavior.
>
>
> Very large operators may only see indications the guard took effect for a
> handful of queries per day — but in instances where ownership disagreement
> is prolonged, the patch is an essential guard against large-scale
> unrecoverable data loss and incorrect responses to queries. I’ll further
> take the position that those few queries in transient disagreement
> scenarios would be justification by themselves.
>
>
> I support merging the patch to all proposed branches and enabling the
> guard by default.
>
>
> – Scott
>
> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com>
> wrote:
>
> 
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
> reduces it.  The investigation and remediation of possible mislocated data
> is still required.
>
>
> All nodes which reject a write prevent mislocated data.  There is still
> the possibility of some node having the same wrong view of the ring as the
> coordinator (including if they are the same node) accepting data.  Unless
> there are multiple nodes with the same wrong view of the ring, data loss is
> prevented for CL > ONE.
>
> 2. Rejecting writes is a louder form of alerting for users unaware of the
> scenario, those not already monitoring logs or metrics.
>
>
> Without this patch no one is aware of any issues at all.  Maybe you are
> referring to a situation where the patch is applied, but the default
> behavior is to still accept the “bad” data?  In that case yes, turning on
> rejection makes it “louder” in that your queries can fail if too many nodes
> are wrong.
>
> 3. Rejecting writes does not capture all places where the problem is
> occurring.  Only logging/metrics fully captures everywhere the problem is
> occurring.
>
>
> Not sure what you are saying here.
>
> nodes can be rejecting writes when they are in fact correct hence causing 
> “over-eager
> unavailability”.
>
>
> When would this occur?  I guess when the node with the bad ring
> information is a replica sent data from a coordinator with the correct ring
> state?  There would be no “unavailability” here unless there were multiple
> nodes in such a state.  I also again would not call this over eager,
> because the node with the bad ring state is f’ed up and needs to be fixed.
> So if being considered unavailable doesn’t seem over-eager to me.
>
> Given the fact that a user can read NEWS.txt and turn off this rejection
> of writes, I see no reason not to err on the side of “the setting which
> gives better protection even if it is not perfect”.  We should not let the
> want to solve everything prevent incremental improvements, especially when
> we actually do have the solution coming in TCM.
>
> -Jeremiah
>
> On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever <m...@apache.org> wrote:
>
>
> I'm less concerned with what the defaults are in each branch, and more the
> accuracy of what we say, e.g. in NEWS.txt
>
> This is my understanding so far, and where I hoped to be corrected.
>
> 1. Rejecting writes does not prevent data loss in this situation.  It only
> reduces it.  The investigation and remediation of possible mislocated data
> is still required.
>
> 2. Rejecting writes is a louder form of alerting for users unaware of the
> scenario, those not already monitoring logs or metrics.
>
> 3. Rejecting writes does not capture all places where the problem is
> occurring.  Only logging/metrics fully captures everywhere the problem is
> occurring.
>
> 4. This situation can be a consequence of other problems (C* or
> operational), not only range movements and the nature of gossip.
>
>
> (2) is the primary argument I see for setting rejection to default.  We
> need to inform the user that data mislocation can still be happening, and
> the only way to fully capture it is via monitoring of enabled
> logging/metrics.  We can also provide information about when range
> movements can cause this, and that nodes can be rejecting writes when they
> are in fact correct hence causing “over-eager unavailability”.  And
> furthermore, point people to TCM.
>
>
>
> On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan <jeremiah.jor...@gmail.com>
> wrote:
>
> JD we know it had nothing to do with range movements and could/should have
> been prevented far simpler with operational correctness/checks.
>
> “Be better” is not the answer.  Also I think you are confusing our
> incidents, the out of range token issue we saw was not because of an
> operational “oops” that could have been avoided.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> If some node has a bad ring state it can easily send no writes to the
> correct place, no need for CL ONE, with the current system behavior CL ALL
> will be successful, with all the nodes sent a mutation happily accepting
> and acking data they do not own.
>
> Yes, even with this patch if you are using CL ONE, if the coordinator has
> a faulty ring state where no replica is “real” and it also decides that it
> is one of the replicas, then you will have a successful write, even though
> no correct node got the data.  If you are using CL ONE you already know you
> are taking on a risk.  Not great, but there should be evidence in other
> nodes of the bad thing occurring at the least.  Also for this same ring
> state, for any CL > ONE with the patch the write would fail (assuming only
> a single node has the bad ring state).
>
> Even when the fix is only partial, so really it's more about more
> forcefully alerting the operator to the problem via over-eager
> unavailability …?
>
>
> Not sure why you are calling this “over-eager unavailability”.  If the
> data is going to the wrong nodes then the nodes may as well be down.
> Unless the end user is writing at CL ANY they have requested to be ACKed
> when CL nodes which own the data have acked getting it.
>
> -Jeremiah
>
> On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever <m...@apache.org> wrote:
>
> Great that the discussion explores the issue as well.
>
> So far we've heard three* companies being impacted, and four times in
> total…?  Info is helpful here.
>
> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm
> assuming the company you refer to doesn't overlap. JD we know it had
> nothing to do with range movements and could/should have been prevented far
> simpler with operational correctness/checks.
>
> In the extreme, when no writes have gone to any of the replicas, what
> happened ? Either this was CL.*ONE, or it was an operational failure (not
> C* at fault).  If it's an operational fault, both the coordinator and the
> node can be wrong.  With CL.ONE, just the coordinator can be wrong and the
> problem still exists (and with rejection enabled the operator is now more
> likely to ignore it).
>
> WRT to the remedy, is it not to either run repair (when 1+ replica has
> it), or to load flushed and recompacted sstables (from the period in
> question) to their correct nodes.  This is not difficult, but
> understandably lost-sleep and time-intensive.
>
> Neither of the above two points I feel are that material to the outcome,
> but I think it helps keep the discussion on track and informative.   We
> also know there are many competent operators out there that do detect data
> loss.
>
>
>
> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe <calebrackli...@gmail.com>
> wrote:
>
> If we don’t reject by default, but log by default, my fear is that we’ll
> simply be alerting the operator to something that has already gone very
> wrong that they may not be in any position to ever address.
>
> 2024 年 9 月 12 日下午 12:44，Jordan West < jw...@apache.org > 写道：
>
> 
> 我赞成在所有分支上默认启用拒绝功能。由于多次没有拒绝，我们遭受了无声数据丢失的困扰（由于 4.1
> 中的架构问题等其他错误），而且如果没有编写极其专业的工具，它是无法恢复的。虽然可用性缺失和数据丢失都很重要，但我总是选择可用性缺失而不是数据丢失。最好是失败一个将要丢失的写入，而不是默默地丢失它。
>
> 当然，这样的改变需要 NEWS.txt
> 和其他地方的良好沟通，但我认为这是值得的。虽然这可能会让一些用户感到惊讶，但我认为他们更惊讶的是他们正在默默地丢失数据。
>
> 约旦
>
> 2024 年 9 月 12 日星期四 10:22，Mick Semb Wever < m...@apache.org > 写道：
>
> 感谢 Caleb 发起这个话题，这是一个影响深远的大补丁。
>
> 意识到关键性，在新的主要版本中，默认拒绝是显而易见的。否则，日志记录和指标是帮助用户验证任何问题的存在和程度的重要补充。
>
>
> 还值得一提的是，拒绝写入可能会导致可用性降低，即使没有问题。这是概率设计上的协调问题，它选择你的邪恶：不必要的可用性降低或错误定位的数据（最终数据丢失）。日志记录和指标使警报和处理数据错误定位成为可能，即通过手动干预避免数据丢失。（日志记录和指标也面临同样的误报问题。）
>
> 我为 5.0.1 中的拒绝默认值 +0，为 4.x 中的仅记录默认值 +1
>
>
> 2024 年 9 月 12 日星期四 18:56，Jeff Jirsa < jji...@gmail.com > 写道：
>
> 这个补丁对我来说太难了。
>
> 它所增加的安全性至关重要，应该在十年前就添加。
> 而且它是一个巨大的补丁，涉及“一切”。
>
> 它绝对属于 5.0。我可能会默认在 5.0.1 中拒绝。
>
> 4.0 / 4.1 - 如果我们将其视为对潜在数据丢失机会的修复（它隐含地是），我猜？
>
>
>
> > 2024 年 9 月 12 日上午 9:46，Brandon Williams < drift...@gmail.com
> <dri...@gmail.com> > 写道：
> >
> > 2024 年 9 月 12 日星期四上午 11:41 Caleb Rackliffe
> > < calebrackli...@gmail.com > 写道：
> >>
> >> 您是否反对整个补丁，还是只是默认拒绝不安全的操作？
> >
> > 我考虑的是后者。更改补丁版本中的任何默认设置都是
> > 对运营商来说这可能是一个惊喜，尤其是这种性质的
> > 所以。
> >
> > 谨致问候，
> > 布兰登
>
>
>

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to