[
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536846#comment-14536846
]
Jonathan Shook commented on CASSANDRA-9318:
-------------------------------------------
I would venture that a solid load shedding system may improve the degenerate
overloading case, but it is not the preferred method for dealing with
overloading for most users. The concept of back-pressure is more squarely what
people expect, for better or worse.
Here is what I think reasonable users want to see, with some variations:
1) The system performs with stability, up to the workload that it is able to
handle with stability.
2a) Once it reaches that limit, it starts pushing back in terms of how quickly
it accepts new work. This means that it simply blocks the operations or
submissions of new requests with some useful bound that is determined by the
system. It does not yet have to shed load. It does not yet have to give
exceptions. This is a very reasonable expectation for most users. This is what
they expect. Load shedding is a term of art which does not change the users
expectations.
2b) Once it reaches that limit, it starts throwing OE to the client. It does
not have to shed load yet. This is a very reasonable expectation for users who
are savvy enough to do active load management at the client level. It may have
to start writing hints, but if you are writing hints because of load, this
might not be the best justification for having the hints system kick in. To me
this is inherently a convenient remedy for the wrong problem, even if it works
well. Yes, hints are there as a general mechanism, but it does not relieve us
of the problem of needing to know when the system is at capacity and how to
handle it proactively. You could also say that hints actively hurt capacity
when you need them most sometimes. They are expensive to process given the
current implementation, and will always be "load shifting" even at theoretical
best. Still we need them for node availability concerns, although we should be
careful to use them as a crutch for general capacity issues.
2c) Once it reaches that limit, it starts backlogging (without a helpful
signature of such in the responses, maybe BackloggingException with some queue
estimate). This is a very reasonable expectation for users who are savvy enough
to manage their peak and valley workloads in a sensible way. Sometimes you
actually want to tax the ingest and flush side of the system for a bit before
allowing it to switch modes and catch up with compaction. The fact that C* can
do this is an interesting capability, but those who want backpressure will not
easily see it that way.
2d) If the system is being pushed beyond its capacity, then it may have to shed
load. This should only happen if the users has decided that they want to be
responsible for such and have pushed the system beyond the reasonable limit
without paying attention to the indications in 2a, 2b, and 2c.
Order of precedence, designated mode of operation, or any other concerns aren't
really addressed here. I just provided them as examples of types of behaviors
which are nuanced yet perfectly valid for different types of system designers.
The real point here is that there is not a single overall design which is going
to be acceptable to all users. Still, we need to ensure stability under
saturating load where possible. I would like to think that with CASSANDRA-8099
that we can start discussing some of the client-facing back-pressure ideas more
earnestly.
We can come up with methods to improve the reliable and responsive capacity of
the system even with some internal load management. If the first cut ends up
being sub-optimal, then we can measure it against non-bounded workload tests
and strive to close the gap. If it is implemented in a way that can support
multiple usage scenarios, as described above, then such a limitation might be
"unlimited", "bounded at level ___", or "bounded by inline resource
management".. But in any case would be controllable by some users/admin,
client.. If we could ultimately give the categories of users above the ability
to enable the various modes, then the 2a) scenario would be perfectly desirable
for many users already even if the back-pressure logic only gave you 70% of the
effective system capacity. Once testing shows that performance with active
back-pressure to the client is close enough to the unbounded workloads, it
could be enabled.
Summary: We still need reasonable back-pressure support throughout the system
and eventually to the client. Features like this that can be a stepping stone
towards such are still needed. The most perfect load shedding and hinting
systems will still not be a sufficient replacement for back-pressure and
capacity management.
> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
> Key: CASSANDRA-9318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 2.1.x
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding
> bytes and requests and if it reaches a high watermark disable read on client
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't
> introduce other issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)