We’ve looked at and rejected rate limiters as high-maintenance and not 
sufficient protection.

We would have run nginx on each node, sent external traffic to nginx on a 
different port and let internal traffic stay on the default Solr port. This has 
other advantages (monitoring), but the rate limiting part is way too fiddly.

Rates depend on how much CPU is used per query and on the size of the cluster 
(if they are not on each node). Some examples from our largest cluster which 
would need a change in rate limits. Some of these could be set by doing offline 
load benchmarks, some not.

* Experiment cell that uses 2.5X more CPU for each query (running now in prod)
* Increasing traffic allocated to that cell (did this last week)
* Increase in index size (number of docs and CPU requirements increase about 5% 
every month)
* Website slowdown that shifts most traffic to mobile, where queries use 2X as 
much CPU
* Horizontal scaling from 24 tp 48 nodes
* Vertical scaling from c5.8xlarge to c5.18xlarge

And so on. Rate limiting would require almost weekly load benchmarks and it 
still wouldn’t catch the outage-causing problems.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 10:25 AM, Atri Sharma <a...@apache.org> wrote:
> 
> The way I look at it is that for cluster level stability, rate limiters 
> should be used which allow rate limiting of only external requests. They are 
> "circuit breakers" in the sense of defending against cluster level 
> instability, which is what you describe.
> 
> Circuit breakers, in Solr world, are targeted to be the last resort defense 
> of a node.
> 
> As I said earlier, it is possible to write a circuit breaker which rejects 
> only external requests, but I personally do not see the benefit in presence 
> of rate limiters.
> 
> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> Ideally, it would only affect a few queries. In reality, with a sharded 
> system, the impact will be large.
> 
> I disagree that the goal is to protect a node. The goal is to make the entire 
> cluster avoid congestion failure when overloaded, while providing good 
> service for the load that it can handle.
> 
> I have had Solr clusters take down entire websites when overloaded, both at 
> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a 
> huge fan of circuit breakers.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <a...@apache.org 
>> <mailto:a...@apache.org>> wrote:
>> 
>> This has an issue of still leading to node outages if the fanout for a query 
>> is high.
>> 
>> Circuit breakers follow a simple rule -- defend the node at the cost of 
>> degraded responses.
>> 
>> Ideally, only few requests will be completely rejected -- some will see 
>> partial results. Due to this non discriminating nature of circuit breakers, 
>> the typical blip on service quality due to high resource usage is short 
>> lived.
>> 
>> However, it is possible to write a circuit breaker which rejects only 
>> external requests in master branch (we have the ability to identify requests 
>> as internal or external there).
>> 
>> Regards,
>> 
>> Atri
>> 
>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wun...@wunderwood.org 
>> <mailto:wun...@wunderwood.org>> wrote:
>> This got zero responses on the solr-user list, so I’ll raise the issue here.
>> 
>> Should circuit breakers only kill external search requests and not 
>> cluster-internal requests to shards?
>> 
>> Circuit breakers can kill any request, whether it is a client request from 
>> outside the cluster or an internal distributed request to a shard. Killing a 
>> portion of distributed request will affect the main request. Not sure 
>> whether a 503 from a shard will kill the whole request or cause partial 
>> results, but it isn’t good.
>> 
>> We run with 8 shards. If a circuit breaker is killing 10% of requests on 
>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That 
>> seems like “overkill” to me. If it only kills external requests, then 10% 
>> means 10%.
>> 
>> Killing only external requests requires that external requests go roughly 
>> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 

Reply via email to