Hi Dmitri, First, let me summarize your issue and tell me if I'm wrong. you have a haproxy balancing traffic to squid in reverse-proxy mode using hash URL as metric. The problem you have, is when a cache gets up after a crash, it's in trouble because of getting too much MISS requests.
Are your HTTP backend servers slow? Have you tried some "tiered" caching? I mean, having a "backend" squid server the "edge" ones would request before going to the HTTP backend. the advantage: the backend squid will be HOT for all your objects. If a squid goes down, haproxy would balance his traffic to other squids, but the backend one will keep on learning objects. When the edge failed squid comes back, then haproxy will balance traffic to it again, whatever the number of request is. That squid will learn object from the backend one. I would do this configuration using ICP. all your edge squids would ask first your backend one for the object. If the backend has the object, then the edge will learn it, otherwise it has to go to the origin. All the most accessed object would remain in the back squid's memory. If the backend squid does not work, you can configure your edge squid to get content from the origin server directly. That way, you can limit the "extra" load on the edge squid when doing cold start. my 2 cents. if you have a lot of traffic from your edge to your backend, you can loadbalance traffic too :) cheers On Tue, Nov 16, 2010 at 1:43 AM, Dmitri Smirnov <[email protected]> wrote: > Willy, > > thank you for taking time to respond. This is always thought provoking. > > On 11/13/2010 12:18 AM, Willy Tarreau wrote: >> >> Why reject instead of redispatching ? Do you fear that non-cached requests >> will have a domino effect on all your squids ? > > Yes, this indeed happens. Also, the objective is not to exceed the number of > connections from squids to backend database. In case of cold cache the > redispatch will cause a cache entry to be brought into the wrong shard which > is unlikely to be reused. Thus this would use up a valuable connection just > to satisfy one request. > > However, even this is an optimistic scenario. These cold cache situations > happen due to external factors like AWS issue, our home grown DNS messed up > (AWS does not provide DNS) and etc which causes not all of the squids to be > reported to proxies and messes up the distribution. This is because haproxy > is restarted after the config file is regenerated. > > I have been thinking about preserving some of the distribution using server > IDs when the set of squids partially changes but that's another story, let's > not digress. > > Thus even with redispatch enabled the other squid is unlikely to have free > connection slots because when it goes cold, most of them do. > > Needless to say, most of the other components in the system are also in > distress in case something happens on a large scale. So I choose the > stability of the system to be the priority even though some of the clients > will be refused service which happens to be the least of the evils. > >> >> I don't see any "http-server-close" there, which makes me think that your >> squids are selected upon the first request of a connection and will still >> have to process all subsequent requests, even if their URL don't hash to >> them. > > Good point. This is not the case, however, forceclose takes care of it and I > can see that most of the time the number of concurrently open connections to > any particular squid changes very quickly in a range of 0-3 even though each > of them handles a chunk requests per second. > > >> Your client timeout at 100ms seems way too low to me. I'm sure you'll >> regularly get truncated responses due to packet losses. In my opinion, >> both your client and server timeouts must cover at least a TCP retransmit >> (3s), so that makes sense to set them to at least 4-5 seconds. Also, do >> not forget that sometimes your cache will have to fetch an object before >> responding, so it could make sense to have a server timeout which is at >> least as big as the squid timeout. > > Agree. Right now server timeout in prod is 5s according to what is > recommended in the docs. In fact, I will probably reverse my timeout changes > to be inline with your recommendations. > > Having slept on the problem I came up with a fairly simple idea which is not > perfect but I think does most of the bang for such a simple change. > > It revolves around of adding a maxconn restriction for every individual > squid on in the backend. > > And the number can be easily calculated and then tuned after a loadtest. > > Let's assume I have 1 haproxy in front of a single squid. > > Furthermore, HIT latency: 5ms, MISS latency 200ms for simplicity. > > Incoming traffic 1,000 rps at peak. > > From squids to backend lets have 50 connections max, i.e. with 250 rps max. > > So through a single connection allowance for hot caches we will be able to > process 200 rps. For cold cache we will do only 5 rps. > > This means that to support Hot traffic we need 5 connections at least. > At the same time this will throttle MISS requests to max of 25. > > Because we have 250 rps max at the backend we can raise maxconn to 50 for > the squid. This creates a range of 250-10,000 rps. > > As caches warm up the traffic becomes mixed and drifts towards the hot model > so the same number of connections will process more and more requests until > it reached 99.6% hit rate in our case. > > I choose to leave a queue size unlimited but put a fast expiration time on > the queue entries so they are rejected with 503 unless you have other > recommendations. > > I also choose to impose an individual maxconn rather than a backend maxconn. > This is to prevent MISS requests to use up all of the connections limit and > allow HITs to be served quickly from hot shards. > I am still pondering over this point though. > > The situation would be more complicated if the maxconn was too big for MISS > and too small for HIT but this is not the case. > > The biggest problem remaining: clients stop seeing rejections when at least > 5 connections are available for HIT traffic. This means that MISS traffic > should be at 225 rps at the most, i.e. caches must be > 77% hot. > > I will test experimentally if this takes too long. > > thanks, > -- > Dmitri Smirnov > > >

