Walter, it sounds like you were doing rate limiting, just in a different
way that is more dynamic than a simple (yet fiddly) constant?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> Rate limiting is a good idea. It requires a lot of ongoing engineering to
> adjust the rates to the current cluster behavior. It doesn’t help with some
> kinds of overload. The ROI just doesn’t work out. It is too much work for
> not enough benefit.
>
> Rate limiting works if the collection size doesn’t change and the queries
> don’t change.
>
> At Netflix, we limited traffic based on number of connections to each
> server. This is basically the length of the queue of requests for that
> server. This is similar to limiting by load average, which is also the work
> waiting to be done. It has the same weaknesses as the load average circuit
> breaker, but it did not need to be changed when average CPU usage per query
> increased. It was “set and forget”. Rate limiters require constant
> adjustment.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Feb 14, 2021, at 11:44 AM, Atri Sharma <a...@apache.org> wrote:
>
> This is a debate better suited for  a different forum  -- but I would
> disagree with your assertion that rate limiting is a bad idea.
>
> Solr allows you to specify node level request quotas which also follow the
> principle of not limiting internal requests. I find that to be pretty
> useful in two forms: 1. Use it in conjunction with a global request limit
> which is typically 0.75 of my total load capacity given my average query
> resource consumption. 2. Allow per node request limits to ensure fairness
> and dedicated capacity for different types of requests. 3. Allow circuit
> breakers to handle cases where a couple of rogue queries can take down
> nodes.
>
> We digress -- as I said, it should be fairly simple to have a circuit
> breaker which rejects only external requests,  but should be clearly
> documented with its downsides.
>
> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wun...@wunderwood.org>
> wrote:
>
>> We’ve looked at and rejected rate limiters as high-maintenance and not
>> sufficient protection.
>>
>> We would have run nginx on each node, sent external traffic to nginx on a
>> different port and let internal traffic stay on the default Solr port. This
>> has other advantages (monitoring), but the rate limiting part is way too
>> fiddly.
>>
>> Rates depend on how much CPU is used per query and on the size of the
>> cluster (if they are not on each node). Some examples from our largest
>> cluster which would need a change in rate limits. Some of these could be
>> set by doing offline load benchmarks, some not.
>>
>> * Experiment cell that uses 2.5X more CPU for each query (running now in
>> prod)
>> * Increasing traffic allocated to that cell (did this last week)
>> * Increase in index size (number of docs and CPU requirements increase
>> about 5% every month)
>> * Website slowdown that shifts most traffic to mobile, where queries use
>> 2X as much CPU
>> * Horizontal scaling from 24 tp 48 nodes
>> * Vertical scaling from c5.8xlarge to c5.18xlarge
>>
>> And so on. Rate limiting would require almost weekly load benchmarks and
>> it still wouldn’t catch the outage-causing problems.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <a...@apache.org> wrote:
>>
>> The way I look at it is that for cluster level stability, rate limiters
>> should be used which allow rate limiting of only external requests. They
>> are "circuit breakers" in the sense of defending against cluster level
>> instability, which is what you describe.
>>
>> Circuit breakers, in Solr world, are targeted to be the last resort
>> defense of a node.
>>
>> As I said earlier, it is possible to write a circuit breaker which
>> rejects only external requests, but I personally do not see the benefit in
>> presence of rate limiters.
>>
>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wun...@wunderwood.org>
>> wrote:
>>
>>> Ideally, it would only affect a few queries. In reality, with a sharded
>>> system, the impact will be large.
>>>
>>> I disagree that the goal is to protect a node. The goal is to make the
>>> entire cluster avoid congestion failure when overloaded, while providing
>>> good service for the load that it can handle.
>>>
>>> I have had Solr clusters take down entire websites when overloaded, both
>>> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m
>>> a huge fan of circuit breakers.
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <a...@apache.org> wrote:
>>>
>>> This has an issue of still leading to node outages if the fanout for a
>>> query is high.
>>>
>>> Circuit breakers follow a simple rule -- defend the node at the cost of
>>> degraded responses.
>>>
>>> Ideally, only few requests will be completely rejected -- some will see
>>> partial results. Due to this non discriminating nature of circuit breakers,
>>> the typical blip on service quality due to high resource usage is short
>>> lived.
>>>
>>> However, it is possible to write a circuit breaker which rejects only
>>> external requests in master branch (we have the ability to identify
>>> requests as internal or external there).
>>>
>>> Regards,
>>>
>>> Atri
>>>
>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wun...@wunderwood.org>
>>> wrote:
>>>
>>>> This got zero responses on the solr-user list, so I’ll raise the issue
>>>> here.
>>>>
>>>> Should circuit breakers only kill external search requests and not
>>>> cluster-internal requests to shards?
>>>>
>>>> Circuit breakers can kill any request, whether it is a client request
>>>> from outside the cluster or an internal distributed request to a shard.
>>>> Killing a portion of distributed request will affect the main request. Not
>>>> sure whether a 503 from a shard will kill the whole request or cause
>>>> partial results, but it isn’t good.
>>>>
>>>> We run with 8 shards. If a circuit breaker is killing 10% of requests
>>>> on each host, that will hit 57% of all external requests (0.9^8 = 0.43).
>>>> That seems like “overkill” to me. If it only kills external requests, then
>>>> 10% means 10%.
>>>>
>>>> Killing only external requests requires that external requests go
>>>> roughly equally to all hosts in the cluster, or at least all NRT or PULL
>>>> replicas.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>
>>>
>>
>

Reply via email to