I'd encourage you to start a new DISCUSS thread around that.

On Fri, Sep 13, 2024 at 2:38 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com>
wrote:

>
> Rejecting/logging the traffic is a significant step forward, but that does
> not solve the real problem. It still degrades the workload and requires
> manual operator's involvement.
>
> How about we also enhance Cassandra to automatically detect and fix the
> token ownership mismatch between StorageService and Gossip cache? More
> details to this ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-18758
>
> Jaydeep
>
> On Thu, Sep 12, 2024 at 9:07 AM Caleb Rackliffe <calebrackli...@gmail.com>
> wrote:
>
>> Until we release TCM, it will continue to be possible for nodes to have a
>> divergent view of the ring, and this means operations can still be sent to
>> the wrong nodes. For example, writes may be sent to nodes that do not and
>> never will own that data, and this opens us up to rather devious silent
>> data loss problems.
>>
>> As some of you may have seen, there is a patch available for 4.0, 4.1,
>> and 5.0 in CASSANDRA-13704
>> <https://issues.apache.org/jira/browse/CASSANDRA-13704> that provides a
>> set of guardrails in the meantime for out-of-range operations. Essentially,
>> there are two new YAML options that control whether or not to log warnings
>> and/or reject operations that shouldn't have arrived at a receiving node.
>>
>> Given that simply logging and recording metrics isn't that invasive, the
>> question we need to answer here is whether we should reject out-of-range
>> operations by default, even in these patch releases. (5.0 has just barely
>> been released, so I'm not sure if that really qualifies, but I digress.)
>> The position I'd like to take is that this is essentially a matter of
>> correctness, and we should *enable rejection by default*. (Keep in mind
>> that both new options are settable at runtime via JMX.) There is precedent
>> for doing something similar to this in CASSANDRA-12126
>> <https://issues.apache.org/jira/browse/CASSANDRA-12126>.
>>
>> The one consequence of that we might discuss here is that if gossip is
>> behind in notifying a node with a pending range, local rejection as it
>> receives writes for that range may cause a small issue of availability.
>> However, this shouldn't happen in a healthy cluster, and even if it does,
>> we're simply translating a silent potential data loss bug into a transient
>> but necessary availability gap with reasonable logging and visibility.
>>
>

Reply via email to