I almost forgot CASSANDRA-15817, which introduced reject_repair_compaction_threshold, which provides a mechanism to stop repairs while compaction is underwater.

On Jan 26, 2024, at 6:22 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote:


Hey all,

I'm a bit late to the discussion. I see that we've already discussed CASSANDRA-15013 and CASSANDRA-16663 at least in passing. Having written the latter, I'd be the first to admit it's a crude tool, although it's been useful here and there, and provides a couple primitives that may be useful for future work. As Scott mentions, while it is configurable at runtime, it is not adaptive, although we did make configuration easier in CASSANDRA-17423. It also is global to the node, although we've lightly discussed some ideas around making it more granular. (For example, keyspace-based limiting, or limiting "domains" tagged by the client in requests, could be interesting.) It also does not deal with inter-node traffic, of course.

Something we've not yet mentioned (that does address internode traffic) is CASSANDRA-17324, which I proposed shortly after working on the native request limiter (and have just not had much time to return to). The basic idea is this:

When a node is struggling under the weight of a compaction backlog and becomes a cause of increased read latency for clients, we have two safety valves:

1.) Disabling the native protocol server, which stops the node from coordinating reads and writes.
2.) Jacking up the severity on the node, which tells the dynamic snitch to avoid the node for reads from other coordinators.

These are useful, but we don’t appear to have any mechanism that would allow us to temporarily reject internode hint, batch, and mutation messages that could further delay resolution of the compaction backlog.


Whether it's done as part of a larger framework or on its own, it still feels like a good idea.

Thinking in terms of opportunity costs here (i.e. where we spend our finite engineering time to holistically improve the experience of operating this database) is healthy, but we probably haven't reached the point of diminishing returns on nodes being able to protect themselves from clients and from other nodes. I would just keep in mind two things:

1.) The effectiveness of rate-limiting in the system (which includes the database and all clients) as a whole necessarily decreases as we move from the application to the lowest-level database internals. Limiting correctly at the client will save more resources than limiting at the native protocol server, and limiting correctly at the native protocol server will save more resources than limiting after we've dispatched requests to some thread pool for processing.
2.) We should make sure the links between the "known" root causes of cascading failures and the mechanisms we introduce to avoid them remain very strong.

In any case, I'd be happy to help out in any way I can as this moves forward (especially as it relates to our past/current attempts to address this problem space).

Reply via email to