Walter, it sounds like you were doing rate limiting, just in a different way that is more dynamic than a simple (yet fiddly) constant?
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <wun...@wunderwood.org> wrote: > Rate limiting is a good idea. It requires a lot of ongoing engineering to > adjust the rates to the current cluster behavior. It doesn’t help with some > kinds of overload. The ROI just doesn’t work out. It is too much work for > not enough benefit. > > Rate limiting works if the collection size doesn’t change and the queries > don’t change. > > At Netflix, we limited traffic based on number of connections to each > server. This is basically the length of the queue of requests for that > server. This is similar to limiting by load average, which is also the work > waiting to be done. It has the same weaknesses as the load average circuit > breaker, but it did not need to be changed when average CPU usage per query > increased. It was “set and forget”. Rate limiters require constant > adjustment. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > On Feb 14, 2021, at 11:44 AM, Atri Sharma <a...@apache.org> wrote: > > This is a debate better suited for a different forum -- but I would > disagree with your assertion that rate limiting is a bad idea. > > Solr allows you to specify node level request quotas which also follow the > principle of not limiting internal requests. I find that to be pretty > useful in two forms: 1. Use it in conjunction with a global request limit > which is typically 0.75 of my total load capacity given my average query > resource consumption. 2. Allow per node request limits to ensure fairness > and dedicated capacity for different types of requests. 3. Allow circuit > breakers to handle cases where a couple of rogue queries can take down > nodes. > > We digress -- as I said, it should be fairly simple to have a circuit > breaker which rejects only external requests, but should be clearly > documented with its downsides. > > On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wun...@wunderwood.org> > wrote: > >> We’ve looked at and rejected rate limiters as high-maintenance and not >> sufficient protection. >> >> We would have run nginx on each node, sent external traffic to nginx on a >> different port and let internal traffic stay on the default Solr port. This >> has other advantages (monitoring), but the rate limiting part is way too >> fiddly. >> >> Rates depend on how much CPU is used per query and on the size of the >> cluster (if they are not on each node). Some examples from our largest >> cluster which would need a change in rate limits. Some of these could be >> set by doing offline load benchmarks, some not. >> >> * Experiment cell that uses 2.5X more CPU for each query (running now in >> prod) >> * Increasing traffic allocated to that cell (did this last week) >> * Increase in index size (number of docs and CPU requirements increase >> about 5% every month) >> * Website slowdown that shifts most traffic to mobile, where queries use >> 2X as much CPU >> * Horizontal scaling from 24 tp 48 nodes >> * Vertical scaling from c5.8xlarge to c5.18xlarge >> >> And so on. Rate limiting would require almost weekly load benchmarks and >> it still wouldn’t catch the outage-causing problems. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> On Feb 14, 2021, at 10:25 AM, Atri Sharma <a...@apache.org> wrote: >> >> The way I look at it is that for cluster level stability, rate limiters >> should be used which allow rate limiting of only external requests. They >> are "circuit breakers" in the sense of defending against cluster level >> instability, which is what you describe. >> >> Circuit breakers, in Solr world, are targeted to be the last resort >> defense of a node. >> >> As I said earlier, it is possible to write a circuit breaker which >> rejects only external requests, but I personally do not see the benefit in >> presence of rate limiters. >> >> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wun...@wunderwood.org> >> wrote: >> >>> Ideally, it would only affect a few queries. In reality, with a sharded >>> system, the impact will be large. >>> >>> I disagree that the goal is to protect a node. The goal is to make the >>> entire cluster avoid congestion failure when overloaded, while providing >>> good service for the load that it can handle. >>> >>> I have had Solr clusters take down entire websites when overloaded, both >>> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m >>> a huge fan of circuit breakers. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <a...@apache.org> wrote: >>> >>> This has an issue of still leading to node outages if the fanout for a >>> query is high. >>> >>> Circuit breakers follow a simple rule -- defend the node at the cost of >>> degraded responses. >>> >>> Ideally, only few requests will be completely rejected -- some will see >>> partial results. Due to this non discriminating nature of circuit breakers, >>> the typical blip on service quality due to high resource usage is short >>> lived. >>> >>> However, it is possible to write a circuit breaker which rejects only >>> external requests in master branch (we have the ability to identify >>> requests as internal or external there). >>> >>> Regards, >>> >>> Atri >>> >>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wun...@wunderwood.org> >>> wrote: >>> >>>> This got zero responses on the solr-user list, so I’ll raise the issue >>>> here. >>>> >>>> Should circuit breakers only kill external search requests and not >>>> cluster-internal requests to shards? >>>> >>>> Circuit breakers can kill any request, whether it is a client request >>>> from outside the cluster or an internal distributed request to a shard. >>>> Killing a portion of distributed request will affect the main request. Not >>>> sure whether a 503 from a shard will kill the whole request or cause >>>> partial results, but it isn’t good. >>>> >>>> We run with 8 shards. If a circuit breaker is killing 10% of requests >>>> on each host, that will hit 57% of all external requests (0.9^8 = 0.43). >>>> That seems like “overkill” to me. If it only kills external requests, then >>>> 10% means 10%. >>>> >>>> Killing only external requests requires that external requests go >>>> roughly equally to all hosts in the cluster, or at least all NRT or PULL >>>> replicas. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>> >>> >> >