A few ideas that you might or might not want to consider:

* As another poster just mentioned you might consider ICP but they suggested having all your squids talk to one master squid. I would instead maybe do this:

Currently, my understanding of your layout:

haproxy -> hashed_url -> squid X of Y -> db shard for X content

If you wanted a probably more robust architecture, you might want to try 1 of a couple different things:

haproxy -> hashed_url -> squid X/Y
                               |
              single hidden squid peer for X/Y -> small haproxy -> db shard

or if you can afford the AWS instances:

haproxy -> hashed_url -> squid Xa/Y
                              |
                         squid Xb/Y -> small haproxy -> db shard for X content
                              |
                         squid Xc/Y -> small haproxy -> db shard for X content
                              |
                         squid Xd/Y -> small haproxy -> db shard for X content

Only one of Xa-d has the IP address used in the haproxy config the others are hidden peers that never directly talk to haproxy or the end users directly. They would share data using ICP or maybe cache digests - cache digests are supposed to be faster depending on workload mix. These might be more resistant to outages.

If it was me, I would go further and create a set of peer squids for each hash value and have each of those load balanced by a small haproxy and manage their connections to their db shard via another haproxy instance :

haproxy -> hashed_url -> ...

                             | squid Xa/Y |
                             |      |     |
                             | squid Xb/Y |
...small haproxy for X/Y ->  |      |     | -> haproxy for db shard X
                             | squid Xc/Y |
                             |      |     |
                             | squid Xd/Y |

In this set up the lines in your main haproxy backend would be pointing to a small haproxy (I don't think there is a way to do this in one single overall haproxy instance) for hash value X of Y. That small haproxy would have N squids all answering user queries and talking to each other as a flat group of peers using ICP or cache digests. That way you would have N squids that are hot for hash value X and if one dies, you can have a longer slow start period and by using ICP/Cache Digests and having N-1 hot caches you would be able to very quickly heat up your new squid instance "N_new" without overloading your db or significantly slowing your ability to answer queries for hash vlaue X.

I would think you want to evolve your config towards allowing some queueing of requests - so that it can absorb some amount of request spikes but without detrimental effect on the backend db, etc. Having the set up above with a haproxy in front and behind your squid group would allow you to have long show start times to allow a new squid to warm up slow without adversely affecting the overall throughput/performance of the system much.

100ms seems like a very short time to wait for clients - if those are real end users and not some internal system that you know is very fast - for instance, you have people using apps on iPhone and iPad and I know that I see bigger delays from wap gates than 100ms so you might want to reconsider some of those timeouts.

On 11/15/10 4:43 PM, Dmitri Smirnov wrote:
Willy,

thank you for taking time to respond. This is always thought provoking.

On 11/13/2010 12:18 AM, Willy Tarreau wrote:

Why reject instead of redispatching ? Do you fear that non-cached requests
will have a domino effect on all your squids ?

Yes, this indeed happens. Also, the objective is not to exceed the number of
connections from squids to backend database. In case of cold cache the
redispatch will cause a cache entry to be brought into the wrong shard which
is unlikely to be reused. Thus this would use up a valuable connection just to
satisfy one request.

However, even this is an optimistic scenario. These cold cache situations
happen due to external factors like AWS issue, our home grown DNS messed up
(AWS does not provide DNS) and etc which causes not all of the squids to be
reported to proxies and messes up the distribution. This is because haproxy is
restarted after the config file is regenerated.

I have been thinking about preserving some of the distribution using server
IDs when the set of squids partially changes but that's another story, let's
not digress.

Thus even with redispatch enabled the other squid is unlikely to have free
connection slots because when it goes cold, most of them do.

Needless to say, most of the other components in the system are also in
distress in case something happens on a large scale. So I choose the stability
of the system to be the priority even though some of the clients will be
refused service which happens to be the least of the evils.


I don't see any "http-server-close" there, which makes me think that your
squids are selected upon the first request of a connection and will still
have to process all subsequent requests, even if their URL don't hash to
them.

Good point. This is not the case, however, forceclose takes care of it and I
can see that most of the time the number of concurrently open connections to
any particular squid changes very quickly in a range of 0-3 even though each
of them handles a chunk requests per second.


Your client timeout at 100ms seems way too low to me. I'm sure you'll
regularly get truncated responses due to packet losses. In my opinion,
both your client and server timeouts must cover at least a TCP retransmit
(3s), so that makes sense to set them to at least 4-5 seconds. Also, do
not forget that sometimes your cache will have to fetch an object before
responding, so it could make sense to have a server timeout which is at
least as big as the squid timeout.

Agree. Right now server timeout in prod is 5s according to what is recommended
in the docs. In fact, I will probably reverse my timeout changes to be inline
with your recommendations.

Having slept on the problem I came up with a fairly simple idea which is not
perfect but I think does most of the bang for such a simple change.

It revolves around of adding a maxconn restriction for every individual squid
on in the backend.

And the number can be easily calculated and then tuned after a loadtest.

Let's assume I have 1 haproxy in front of a single squid.

Furthermore, HIT latency: 5ms, MISS latency 200ms for simplicity.

Incoming traffic 1,000 rps at peak.

 From squids to backend lets have 50 connections max, i.e. with 250 rps max.

So through a single connection allowance for hot caches we will be able to
process 200 rps. For cold cache we will do only 5 rps.

This means that to support Hot traffic we need 5 connections at least.
At the same time this will throttle MISS requests to max of 25.

Because we have 250 rps max at the backend we can raise maxconn to 50 for the
squid. This creates a range of 250-10,000 rps.

As caches warm up the traffic becomes mixed and drifts towards the hot model
so the same number of connections will process more and more requests until it
reached 99.6% hit rate in our case.

I choose to leave a queue size unlimited but put a fast expiration time on the
queue entries so they are rejected with 503 unless you have other
recommendations.

I also choose to impose an individual maxconn rather than a backend maxconn.
This is to prevent MISS requests to use up all of the connections limit and
allow HITs to be served quickly from hot shards.
I am still pondering over this point though.

The situation would be more complicated if the maxconn was too big for MISS
and too small for HIT but this is not the case.

The biggest problem remaining: clients stop seeing rejections when at least 5
connections are available for HIT traffic. This means that MISS traffic should
be at 225 rps at the most, i.e. caches must be > 77% hot.

I will test experimentally if this takes too long.

thanks,

Reply via email to