Hi Dmitri,

On Fri, Nov 12, 2010 at 11:43:37AM -0800, Dmitri Smirnov wrote:
(...)
> There is a drastic difference in the performance characteristics for the 
> system when caches are cold and when they are hot.
> 
> While we are hot we serve about 6,000-10,000 rps, queues to backends are 
> zero, the number of concurrent connections to backends is near zero.
> This boils down to a range 600 - 1,000 rps per haproxy instance for 10 
> instances.
> 
> With cold caches or when the distribution is thrown off the latencies 
> shoot up 2-3 orders of magnitude and the number of concurrent 
> connections to squids goes up to hundreds. This lead to a flood client 
> retries( being fixed now) often maxing out the number of sockets leading 
> haproxy to believe that squids are not reachable and marking them down 
> (flip/flop). This led to redispatches making the picture even worse.
> 
> The limiting factor here is a latency and the number of concurrent 
> persistent connections that can be established back to the database from 
> squids which I believe to be around 500.
> Naturally, we have caches here in the first place for the reason.
> This problem will require some time to address and is being actively 
> worked on.
> 
> While there are some fundamental problems here to work on, I was 
> wondering if I could quickly tweak haproxies configuration to gracefully 
> support both modes of operation in a short term since it is currently 
> the only place in a chain where a powerful scripting can be done.
> 
> The objectives are:
> 
> 1) allow maximum possible throughput when caches are hot
> 
> 2) When caches are cold sustain a level of throughput that will allow 
> caches to warm up w/o melting the system down.

This is what the slowstart is intended for. It will progressively adjust
the server's weight once the server is seen as operational. This is
compatible with consistent hashing and is very efficient, because your
server will slowly get more and more URLs, so it will have some time to
cache them.

This just assumes that every time a server is restarted, its cache may
be cold. If this happens too rarely to involve a long slowstart interval,
maybe you'd better adjust the server's weight from haproxy's command line
while it's heating up ?

> 3) detect a slowdown by checking or/and
> - avg_queue size
> - queue size
> - number of concurrent connections going up

This is part of a work that was begun 3 years ago and for which only
the dynamic weights were implemented. This is what we called automatic
weight balancing. The principle is to define one or more metrics that
could be used to adjust weights. For instance, response times, concurrent
connections, error rate or any data found in response headers or health
checks about the server's health, etc... This is a complex but very
interesting part which still needs to be worked on.

> 4) Quickly reject requests that come beyond predetermined cold cache 
> capacity. If possible, do it on an individual server level rather than 
> on a backend level ( for cases when only some caches are cold).

You may try the "maxqueue" server parameter, it's supposed to take
excess requests out of their queues and redispatch them to other
servers. However, since you're using a hash, I doubt they would be
redispatched. Still that may be something to try.

> One of the issues here is that if I specify maxconn for the individual 
> server, the connection is not rejected but goes to a queue. If I limit 
> the queue size then when timeout expires it will redispatch to another 
> server.  I want re-dispatches only when a squid is down.

Why reject instead of redispatching ? Do you fear that non-cached requests
will have a domino effect on all your squids ?

> Below is a version of config under construction and somewhat simplified.

I don't see any "http-server-close" there, which makes me think that your
squids are selected upon the first request of a connection and will still
have to process all subsequent requests, even if their URL don't hash to
them.

> I will work out the exact numbers later. Right now server maxconn, 
> slowstart timeout and queue threshold is a pure speculation.

My guess is that a slowstart should be very long for a cache, about the
time it takes to reach nominal speed (maybe 10 minutes or so).

> I would appreciate any help as I am trying to wrap my brain around a lot 
> of variables here and available tuning knobs.

Your client timeout at 100ms seems way too low to me. I'm sure you'll
regularly get truncated responses due to packet losses. In my opinion,
both your client and server timeouts must cover at least a TCP retransmit
(3s), so that makes sense to set them to at least 4-5 seconds. Also, do
not forget that sometimes your cache will have to fetch an object before
responding, so it could make sense to have a server timeout which is at
least as big as the squid timeout.

Your overload backend is a good idea. In my opinion, it could support some
queuing, low maxconns on the servers and make use of the leastconn LB algo.
That way it will not consider cache freshness but will try to connect to the
least used cache to get the work done. I really think it can improve the
overall throughput (and it's easy to test with and without it). It could
also cover for long slowstarts when everything is restarted, because it
will take care of the service regardless of the slowstart in the other
backend.

Regards,
Willy

> -- 
> Dmitri Smirnov
> 
> # This is a CE haproxy test config boilerplate
> global
>     daemon
>     stats socket /apps/haproxy/var/stats level admin
>     maxconn 10000
> 
> defaults
>     mode http
>     balance uri
>     hash-type consistent
> # local0 needs to be configured at /etc/syslog.conf
>     log /dev/log local0
>     option httplog
> 
> # Maximum number of concurrent connections on the frontend
> # set to be the half of the total max in the global section above
>     maxconn 5000
> 
> # timeout client is the max time of client inactivity
> # when the client is expected to ack or send data
> # we do not want to tie up for long time
>     timeout client  100ms
> 
> # This is a max time to wait for connection to a server to succeed
>     timeout connect 200ms
> 
> # This is a maximum timeout to wait in a queue at the backend
> # by default it is the same as timeout connect but we set it explicitely
> # Below we do not allow the queue to grow beyond 1 as this indicates 
> that servers
> # are slow and overloaded.
>     timeout queue 200ms
> 
> # Maximum inactivity timeout for the server to ack or send data
> # In other words, in situtions of meltdown we are not going to wait for 
> slow data to come back ( not what is currently in prod)
> # but this will still hopefully allow squid to refill
> # max time is usually less than a second
>     timeout server 1000ms
> 
> frontend http-in
>    bind *:8080
>    default_backend servers
> 
> # Problem, if one squid is cold this reject requests for the whole farm
>    acl q_too_long avg_queue(servers) gt 0
>    use_backend overload if q_too_long
> 
> backend overload
> # HAproxy will issue 503 because no servers available for this backend
> # Here we customize the response
>     errorfile 503 /apps/haproxy/etc/fe_503.http
> 
> backend servers
>     stats enable
>     stats uri     /haproxy?status
>     stats refresh 5s
>     stats show-legends
>     stats show-node
>     option forceclose
>     option forwardfor
> 
> # Redispatch if the destination server is down. This option will also
> # redispatch if a queue timeout expired. However, we do not want
> # to redispatch in that case.
>     option redispatch
>     retries 1
> 
> # Dynamically generated section follows.
> # Example
> server ec2-XXXX   ec2-XXXX.compute-1.amazonaws.com:8080 check inter 1000 
> rise 5 fall 3 maxconn 20 slowstart 30s
> 

Reply via email to