Hi Dmitry,

sorry for the delay, I really didn't have time to analyse the config
you sent me.

A few points below :

On Wed, Oct 07, 2015 at 04:18:20PM +0300, Dmitry Sivachenko wrote:
> Oct  7 08:33:03 srv1 haproxy[77565]: unix:1 [07/Oct/2015:08:33:02.428] 
> MT-front MT_RU_EN-back/<NOSRV> 0/1000/-1/-1/1000 503 212 - - sQ-- 
> 125/124/108/0/0 0/28 "POST /some/url HTTP/1.1"
> (many similar at one moment)
> 
> Common part in these errors is "1000" in Tw and Tt, and "sQ--" termination 
> state.
> 
> Here is the relevant part on my config (I can post more if needed):
> 
> defaults
>     balance roundrobin
>     maxconn 10000
>     timeout queue 1s
>     fullconn 3000
>     default-server inter 5s downinter 1s fastinter 500ms fall 3 rise 1 
> slowstart 60s maxqueue 1 minconn 5 maxconn 150
> 
> backend MT_RU_EN-back
>     mode http
>     timeout server 30s
>     server mt1-34 mt1-34:19016 track MT-back/mt1-34 weight 38
>     server mt1-35 mt1-35:19016 track MT-back/mt1-35 weight 38
>     <total 18 of similar servers>
> 
> So this error log indicates that request was sitting in the queue for timeout 
> queue==1s and his turn did not come.
> 
> In the stats web interface for MT_RU_EN-back backend I see the following 
> numbers:
> 
> Sessions: limit=3000, max=126 (for the whole backend)
> Limit=150, max=5 or 6 (for each server)
> 
> If I understand minconn/maxconn meaning right, each server should accept up 
> to min(150, 3000/18) connections
> 
> So according to stats the load were far from limits.

No, look, the log says there were 108 connections on the backend. This
is important since you're using minconn so you're using dynamic queueing.
This means that the effective limit when handling this request was around
maxconn*currconn/fullconn, which is 150*108/3000 = 5.4 so the limit was
at 5 connections. Thus the limit for this server was indeed reached.

Playing with minconn and fullconn is hard and strongly advised against,
unless you know exactly how to tune it. You must always ensure that a
normal load will be handled without queuing (or with a very small queue),
and that maxconn will quickly be reached to handle high traffic. I tend to
consider that an efficient fullconn is around 10% of the maximum load the
farm may have to deal with (which is the default value IIRC). Regarding
minconn, it's interesting not to set it too low. A good rule of thumb is
to estimate what would happen at 10% of fullconn (1% of the max load).
In your case, at 300 concurrent connections, your servers will accept
15 connections each. I have no idea whether this is enough or not to
handle the load. But let's say you have 4 servers, that's only 60
concurrent connections to process 300 front connections. While it can
be perfectly fine, you may need to increase the queue timeout so that
the requests can wait long enough for a slot. With a 5:1 overbooking and
your 1s queue timeout, that means you expect that the server's average
response time will not go above 200ms. That may be a bit short for some
applications, especially those sensitive to connection count.

Thus I'd suggest that you either lower fullconn or increase minconn, and
in any case that you also increase the queue timeout to cover the worst
overbooking situation with the average server's response time.

During the tuning phase, I'd suggest to *significantly* increase the queue
timeout so that you can observe the connection counts and even the average
response time per connection count, that will help you refine the tuning.

Hoping this helps,
Willy


Reply via email to