Rate limiting is a good idea. It requires a lot of ongoing engineering to adjust the rates to the current cluster behavior. It doesn’t help with some kinds of overload. The ROI just doesn’t work out. It is too much work for not enough benefit.
Rate limiting works if the collection size doesn’t change and the queries don’t change. At Netflix, we limited traffic based on number of connections to each server. This is basically the length of the queue of requests for that server. This is similar to limiting by load average, which is also the work waiting to be done. It has the same weaknesses as the load average circuit breaker, but it did not need to be changed when average CPU usage per query increased. It was “set and forget”. Rate limiters require constant adjustment. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 14, 2021, at 11:44 AM, Atri Sharma <a...@apache.org> wrote: > > This is a debate better suited for a different forum -- but I would > disagree with your assertion that rate limiting is a bad idea. > > Solr allows you to specify node level request quotas which also follow the > principle of not limiting internal requests. I find that to be pretty useful > in two forms: 1. Use it in conjunction with a global request limit which is > typically 0.75 of my total load capacity given my average query resource > consumption. 2. Allow per node request limits to ensure fairness and > dedicated capacity for different types of requests. 3. Allow circuit breakers > to handle cases where a couple of rogue queries can take down nodes. > > We digress -- as I said, it should be fairly simple to have a circuit breaker > which rejects only external requests, but should be clearly documented with > its downsides. > > On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wun...@wunderwood.org > <mailto:wun...@wunderwood.org>> wrote: > We’ve looked at and rejected rate limiters as high-maintenance and not > sufficient protection. > > We would have run nginx on each node, sent external traffic to nginx on a > different port and let internal traffic stay on the default Solr port. This > has other advantages (monitoring), but the rate limiting part is way too > fiddly. > > Rates depend on how much CPU is used per query and on the size of the cluster > (if they are not on each node). Some examples from our largest cluster which > would need a change in rate limits. Some of these could be set by doing > offline load benchmarks, some not. > > * Experiment cell that uses 2.5X more CPU for each query (running now in prod) > * Increasing traffic allocated to that cell (did this last week) > * Increase in index size (number of docs and CPU requirements increase about > 5% every month) > * Website slowdown that shifts most traffic to mobile, where queries use 2X > as much CPU > * Horizontal scaling from 24 tp 48 nodes > * Vertical scaling from c5.8xlarge to c5.18xlarge > > And so on. Rate limiting would require almost weekly load benchmarks and it > still wouldn’t catch the outage-causing problems. > > wunder > Walter Underwood > wun...@wunderwood.org <mailto:wun...@wunderwood.org> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) > >> On Feb 14, 2021, at 10:25 AM, Atri Sharma <a...@apache.org >> <mailto:a...@apache.org>> wrote: >> >> The way I look at it is that for cluster level stability, rate limiters >> should be used which allow rate limiting of only external requests. They are >> "circuit breakers" in the sense of defending against cluster level >> instability, which is what you describe. >> >> Circuit breakers, in Solr world, are targeted to be the last resort defense >> of a node. >> >> As I said earlier, it is possible to write a circuit breaker which rejects >> only external requests, but I personally do not see the benefit in presence >> of rate limiters. >> >> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wun...@wunderwood.org >> <mailto:wun...@wunderwood.org>> wrote: >> Ideally, it would only affect a few queries. In reality, with a sharded >> system, the impact will be large. >> >> I disagree that the goal is to protect a node. The goal is to make the >> entire cluster avoid congestion failure when overloaded, while providing >> good service for the load that it can handle. >> >> I have had Solr clusters take down entire websites when overloaded, both at >> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a >> huge fan of circuit breakers. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) >> >>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <a...@apache.org >>> <mailto:a...@apache.org>> wrote: >>> >>> This has an issue of still leading to node outages if the fanout for a >>> query is high. >>> >>> Circuit breakers follow a simple rule -- defend the node at the cost of >>> degraded responses. >>> >>> Ideally, only few requests will be completely rejected -- some will see >>> partial results. Due to this non discriminating nature of circuit breakers, >>> the typical blip on service quality due to high resource usage is short >>> lived. >>> >>> However, it is possible to write a circuit breaker which rejects only >>> external requests in master branch (we have the ability to identify >>> requests as internal or external there). >>> >>> Regards, >>> >>> Atri >>> >>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wun...@wunderwood.org >>> <mailto:wun...@wunderwood.org>> wrote: >>> This got zero responses on the solr-user list, so I’ll raise the issue here. >>> >>> Should circuit breakers only kill external search requests and not >>> cluster-internal requests to shards? >>> >>> Circuit breakers can kill any request, whether it is a client request from >>> outside the cluster or an internal distributed request to a shard. Killing >>> a portion of distributed request will affect the main request. Not sure >>> whether a 503 from a shard will kill the whole request or cause partial >>> results, but it isn’t good. >>> >>> We run with 8 shards. If a circuit breaker is killing 10% of requests on >>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That >>> seems like “overkill” to me. If it only kills external requests, then 10% >>> means 10%. >>> >>> Killing only external requests requires that external requests go roughly >>> equally to all hosts in the cluster, or at least all NRT or PULL replicas. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog) >> >