It’s worth noting though that a very large engineering effort called “Transactional Cluster Metadata” is already wrapping up that properly addresses these problems, but that will be landing in 5.1 and won’t be suitable for back-porting.


On 13 Sep 2024, at 21:32, Caleb Rackliffe <calebrackli...@gmail.com> wrote:


I'd encourage you to start a new DISCUSS thread around that.

On Fri, Sep 13, 2024 at 2:38 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote:

Rejecting/logging the traffic is a significant step forward, but that does not solve the real problem. It still degrades the workload and requires manual operator's involvement.

How about we also enhance Cassandra to automatically detect and fix the token ownership mismatch between StorageService and Gossip cache? More details to this ticket: https://issues.apache.org/jira/browse/CASSANDRA-18758

Jaydeep

On Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe <calebrackli...@gmail.com> wrote:
Until we release TCM, it will continue to be possible for nodes to have a divergent view of the ring, and this means operations can still be sent to the wrong nodes. For example, writes may be sent to nodes that do not and never will own that data, and this opens us up to rather devious silent data loss problems.

As some of you may have seen, there is a patch available for 4.0, 4.1, and 5.0 in CASSANDRA-13704 that provides a set of guardrails in the meantime for out-of-range operations. Essentially, there are two new YAML options that control whether or not to log warnings and/or reject operations that shouldn't have arrived at a receiving node.

Given that simply logging and recording metrics isn't that invasive, the question we need to answer here is whether we should reject out-of-range operations by default, even in these patch releases. (5.0 has just barely been released, so I'm not sure if that really qualifies, but I digress.) The position I'd like to take is that this is essentially a matter of correctness, and we should enable rejection by default. (Keep in mind that both new options are settable at runtime via JMX.) There is precedent for doing something similar to this in CASSANDRA-12126.

The one consequence of that we might discuss here is that if gossip is behind in notifying a node with a pending range, local rejection as it receives writes for that range may cause a small issue of availability. However, this shouldn't happen in a healthy cluster, and even if it does, we're simply translating a silent potential data loss bug into a transient but necessary availability gap with reasonable logging and visibility.

Reply via email to