[ 
https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200936#comment-13200936
 ] 

Peter Schuller commented on CASSANDRA-3852:
-------------------------------------------

I feel exactly the opposite in almost all practical use-cases I have ever 
encountered in production. The impact of a service (such as cassandra) becoming 
a tarpit to all requests tends to cause indirect effects on other things unless 
every component in the stack is perfectly written to be independent of e.g. 
concurrency spiking; having a select few requests be slightly slower is 
typically a much less impactful behavior.

More specifically, failure modes where you have a lack of forward progress is a 
classic problem that affect systems with multiple components. One classic way 
to get into this position is to have timeouts incorrectly set such that the 
backend system is doing work whose results are thrown away because a system in 
front of it had a lower timeout. This is why the default should be to have the 
rpc timeout in Cassandra higher than the client timeout. In this case, any 
system not perfectly built to have latencies correctly configure will be much 
better served by falling back to LIFO.

Further, I strongly disagree that optimizing for the non-overload case is the 
right approach. In most cases, averages are given much too much credence. It's 
completely irrelevant to most serious production use-cases whether your average 
is 5 ms instead of 6 ms; what tends to be much more important is that you don't 
explode/fall over as soon as there is a hiccup somewhere. This is a similar 
observation to the discussion in CASSANDRA-2540. Slight differences in the 
average (or median) latency is not what brings down a production system; sudden 
explosions and things falling over, is.

Also, consider that an overload condition is not this once-in-a-year case where 
someone forgot to add capacity. This behavior characteristic applies whenever:

* A node goes into a GC pause.
* Node(s) are restarted and caches are cold.
* Node(s) are backed up by streaming/compaction (easily a problem if practical 
concerns caused a late capacity add, especially as long as CASSANDRA-3569 is in 
effect).
* The operating system decides to flush dirty buffers to disk and for some 
reason (streaming, compaction) there's a non-trivial impact (say, 500 ms) on 
I/O.
* Probably a slew of cases I'm not thinking of :)

Real production systems aren't very usefully judged by the overall latencies 
over time without looking at outliers (not just in terms of pXX latencies, 
outliers in behavior, etc); what matters is whether the system works reliably 
and well consistently across a longer period of time, rather than having 
hiccups significantly affecting other services every now and then as soon as 
there is some issue with a node. At least this has been all my experience so 
far, ever, with systems running at high throughput/capacity (rather than 90% 
idle) where high availability is a concern. 

I think that if someone truly has a situation where surrounding systems are 
perfect and they really truly care about a millisecond here and there on 
averages, or care more about absolutely minimizing outliers in non-overload 
conditions, they can easily turn the feature I'm suggesting here off (or not 
turn it on, if it's not the default), and they should be worried about enabling 
data reads by default, submitting requests to multiple co-ordinators to avoid 
co-ordinator GC pauses (probabilistically), replace the dynamic snitch to 
something that reacts faster (even if it takes CPU expense), etc. I am not 
saying these people don't exist (we have potential use-cases here with very 
strict latency requirements which will probably take measure like these to make 
it feasable), but I'm saying that these situations are not really serviced 
right now *anyway* and it's sufficiently special that it's okay if default 
behavior is not optimizing for this case.

The normal case, with clients being significantly imperfect systems, often 
things like web apps or clients that have concurrency limits and aren't perfect 
asynchronous event loops with tight control over e.g. memory use, and that thus 
are susceptible to negative effects incurred by concurrency spikes, is, I 
think, much better served by what I'm suggesting.

In addition, even assuming perfectly working systems, consider the user visible 
impact of a single node being temporarily overloaded. Instead of every single 
user whose request hit this node experiencing low latency, a tiny subset could 
potentially have that experience.

I would argue that a better way to avoid "normal steady state" outliers is to 
do things like CASSANDRA-2540 data reads by default, ensuring the dynamic 
snitch is real-time and looks at current outstanding requests, etc. 
Probabilistic improvements that attempt to mitigate the effects of individual 
variances in the system, whether they be due to LIFO, GC pauses or other 
things. It feels strange to retain strict FIFO to eliminate outliers, while we 
still have a failure detector/dynamic snitch combination which utterly fails to 
handle the most trivial case you can imagine (again, the "tcp connection dies 
with a reset and I'm gonna keep queuing up hundreds of thousands of outstanding 
requests to this node blindly while other nodes are not even close to capacity 
and have < 10 outstanding requests"). I don't understand how the latter can be 
acceptable if we are so worried about outliers that moving away from strict 
FIFO is not acceptable.

In summary: While I certainly do not claim at all that the proposed LIFO 
mechanism is perfect (far from it), I think the disadvantages incurred by it 
are tiny compared to some of the other behaviors that need to be fixed, while 
the positive advantages that would be had in common production scenarios are 
very significant.

In any case, given that at least a draft prototype is not very difficult to get 
done I'll definitely go ahead and do something (not sure when) with this. We 
might need to come back with actual numbers from production clusters (not a 
promise, depends on priorities and other things).


                
> use LIFO queueing policy when queue size exceeds thresholds
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-3852
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3852
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> A strict FIFO policy for queueing (between stages) is detrimental to latency 
> and forward progress. Whenever a node is saturated beyond incoming request 
> rate, *all* requests become slow. If it is consistently saturated, you start 
> effectively timing out on *all* requests.
> A much better strategy from the point of view of latency is to serve a subset 
> requests quickly, and letting some time out, rather than letting all either 
> time out or be slow.
> Care must be taken such that:
> * We still guarantee that requests are processed reasonably timely (we 
> couldn't go strict LIFO for example as that would result in requests getting 
> stuck potentially forever on a loaded node).
> * Maybe, depending on the previous point's solution, ensure that some 
> requests bypass the policy and get prioritized (e.g., schema migrations, or 
> anything "internal" to a node).
> A possible implementation is to go LIFO whenever there are requests in the 
> queue that are older than N milliseconds (or a certain queue size, etc).
> Benefits:
> * All cases where the client is directly, or is indirectly affecting through 
> other layers, a system which has limited concurrency (e.g., thread pool size 
> of X to serve some incoming request rate), it is *much* better for a few 
> requests to time out while most are serviced quickly, than for all requests 
> to become slow, as it doesn't explode concurrency. Think any random 
> non-super-advanced php app, ruby web app, java servlet based app, etc. 
> Essentially, it optimizes very heavily for improved average latencies.
> * Systems with strict p95/p99/p999 requirements on latencies should greatly 
> benefit from such a policy. For example, suppose you have a system at 85% of 
> capacity, and it takes a write spike (or has a hiccup like GC pause, blocking 
> on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of 
> requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 
> seconds to recover. Instead of *all* requests for an entire 3.2 second window 
> being slow, we'd serve requests quickly for 2.7 of those seconds, with the 
> incoming requests during that 500 ms interval being the ones primarily 
> affected. The flip side though is that once you're at the point where more 
> than N percent of requests end up having to wait for others to take LIFO 
> priority, the p(100-N) latencies will actually be *worse* than without this 
> change (but at this point you have to consider what the root reason for those 
> pXX requirements are).
> * In the case of complete saturation, it allows forward progress. Suppose 
> you're taking 25% more traffic than you are able to handle. Instead of 
> getting backed up and ending up essentially timing out *every single 
> request*, you will succeed in processing up to 75% of them (I say "up to" 
> because it depends; for example on a {{QUORUM}} request you need at least two 
> of the requests from the co-ordinator to succeed so the percentage is brought 
> down) and allowing clients to make forward progress and get work done, rather 
> than being stuck.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to