[ https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200936#comment-13200936 ]
Peter Schuller commented on CASSANDRA-3852: ------------------------------------------- I feel exactly the opposite in almost all practical use-cases I have ever encountered in production. The impact of a service (such as cassandra) becoming a tarpit to all requests tends to cause indirect effects on other things unless every component in the stack is perfectly written to be independent of e.g. concurrency spiking; having a select few requests be slightly slower is typically a much less impactful behavior. More specifically, failure modes where you have a lack of forward progress is a classic problem that affect systems with multiple components. One classic way to get into this position is to have timeouts incorrectly set such that the backend system is doing work whose results are thrown away because a system in front of it had a lower timeout. This is why the default should be to have the rpc timeout in Cassandra higher than the client timeout. In this case, any system not perfectly built to have latencies correctly configure will be much better served by falling back to LIFO. Further, I strongly disagree that optimizing for the non-overload case is the right approach. In most cases, averages are given much too much credence. It's completely irrelevant to most serious production use-cases whether your average is 5 ms instead of 6 ms; what tends to be much more important is that you don't explode/fall over as soon as there is a hiccup somewhere. This is a similar observation to the discussion in CASSANDRA-2540. Slight differences in the average (or median) latency is not what brings down a production system; sudden explosions and things falling over, is. Also, consider that an overload condition is not this once-in-a-year case where someone forgot to add capacity. This behavior characteristic applies whenever: * A node goes into a GC pause. * Node(s) are restarted and caches are cold. * Node(s) are backed up by streaming/compaction (easily a problem if practical concerns caused a late capacity add, especially as long as CASSANDRA-3569 is in effect). * The operating system decides to flush dirty buffers to disk and for some reason (streaming, compaction) there's a non-trivial impact (say, 500 ms) on I/O. * Probably a slew of cases I'm not thinking of :) Real production systems aren't very usefully judged by the overall latencies over time without looking at outliers (not just in terms of pXX latencies, outliers in behavior, etc); what matters is whether the system works reliably and well consistently across a longer period of time, rather than having hiccups significantly affecting other services every now and then as soon as there is some issue with a node. At least this has been all my experience so far, ever, with systems running at high throughput/capacity (rather than 90% idle) where high availability is a concern. I think that if someone truly has a situation where surrounding systems are perfect and they really truly care about a millisecond here and there on averages, or care more about absolutely minimizing outliers in non-overload conditions, they can easily turn the feature I'm suggesting here off (or not turn it on, if it's not the default), and they should be worried about enabling data reads by default, submitting requests to multiple co-ordinators to avoid co-ordinator GC pauses (probabilistically), replace the dynamic snitch to something that reacts faster (even if it takes CPU expense), etc. I am not saying these people don't exist (we have potential use-cases here with very strict latency requirements which will probably take measure like these to make it feasable), but I'm saying that these situations are not really serviced right now *anyway* and it's sufficiently special that it's okay if default behavior is not optimizing for this case. The normal case, with clients being significantly imperfect systems, often things like web apps or clients that have concurrency limits and aren't perfect asynchronous event loops with tight control over e.g. memory use, and that thus are susceptible to negative effects incurred by concurrency spikes, is, I think, much better served by what I'm suggesting. In addition, even assuming perfectly working systems, consider the user visible impact of a single node being temporarily overloaded. Instead of every single user whose request hit this node experiencing low latency, a tiny subset could potentially have that experience. I would argue that a better way to avoid "normal steady state" outliers is to do things like CASSANDRA-2540 data reads by default, ensuring the dynamic snitch is real-time and looks at current outstanding requests, etc. Probabilistic improvements that attempt to mitigate the effects of individual variances in the system, whether they be due to LIFO, GC pauses or other things. It feels strange to retain strict FIFO to eliminate outliers, while we still have a failure detector/dynamic snitch combination which utterly fails to handle the most trivial case you can imagine (again, the "tcp connection dies with a reset and I'm gonna keep queuing up hundreds of thousands of outstanding requests to this node blindly while other nodes are not even close to capacity and have < 10 outstanding requests"). I don't understand how the latter can be acceptable if we are so worried about outliers that moving away from strict FIFO is not acceptable. In summary: While I certainly do not claim at all that the proposed LIFO mechanism is perfect (far from it), I think the disadvantages incurred by it are tiny compared to some of the other behaviors that need to be fixed, while the positive advantages that would be had in common production scenarios are very significant. In any case, given that at least a draft prototype is not very difficult to get done I'll definitely go ahead and do something (not sure when) with this. We might need to come back with actual numbers from production clusters (not a promise, depends on priorities and other things). > use LIFO queueing policy when queue size exceeds thresholds > ----------------------------------------------------------- > > Key: CASSANDRA-3852 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3852 > Project: Cassandra > Issue Type: Improvement > Reporter: Peter Schuller > Assignee: Peter Schuller > > A strict FIFO policy for queueing (between stages) is detrimental to latency > and forward progress. Whenever a node is saturated beyond incoming request > rate, *all* requests become slow. If it is consistently saturated, you start > effectively timing out on *all* requests. > A much better strategy from the point of view of latency is to serve a subset > requests quickly, and letting some time out, rather than letting all either > time out or be slow. > Care must be taken such that: > * We still guarantee that requests are processed reasonably timely (we > couldn't go strict LIFO for example as that would result in requests getting > stuck potentially forever on a loaded node). > * Maybe, depending on the previous point's solution, ensure that some > requests bypass the policy and get prioritized (e.g., schema migrations, or > anything "internal" to a node). > A possible implementation is to go LIFO whenever there are requests in the > queue that are older than N milliseconds (or a certain queue size, etc). > Benefits: > * All cases where the client is directly, or is indirectly affecting through > other layers, a system which has limited concurrency (e.g., thread pool size > of X to serve some incoming request rate), it is *much* better for a few > requests to time out while most are serviced quickly, than for all requests > to become slow, as it doesn't explode concurrency. Think any random > non-super-advanced php app, ruby web app, java servlet based app, etc. > Essentially, it optimizes very heavily for improved average latencies. > * Systems with strict p95/p99/p999 requirements on latencies should greatly > benefit from such a policy. For example, suppose you have a system at 85% of > capacity, and it takes a write spike (or has a hiccup like GC pause, blocking > on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of > requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 > seconds to recover. Instead of *all* requests for an entire 3.2 second window > being slow, we'd serve requests quickly for 2.7 of those seconds, with the > incoming requests during that 500 ms interval being the ones primarily > affected. The flip side though is that once you're at the point where more > than N percent of requests end up having to wait for others to take LIFO > priority, the p(100-N) latencies will actually be *worse* than without this > change (but at this point you have to consider what the root reason for those > pXX requirements are). > * In the case of complete saturation, it allows forward progress. Suppose > you're taking 25% more traffic than you are able to handle. Instead of > getting backed up and ending up essentially timing out *every single > request*, you will succeed in processing up to 75% of them (I say "up to" > because it depends; for example on a {{QUORUM}} request you need at least two > of the requests from the co-ordinator to succeed so the percentage is brought > down) and allowing clients to make forward progress and get work done, rather > than being stuck. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira