Re: 1.3.17 in TCP mode sees dead servers (but they're not)

Willy Tarreau Wed, 06 May 2009 21:27:23 -0700

On Mon, May 04, 2009 at 11:47:10AM +0200, Nicolas MONNET wrote:
> I'm experiencing a problem since updating to 1.3.17, whereby checks
> periodically see a backend service as down, one at a time, but for all 3
> checks; and it picks right up again on the next check. Not sure what
> info I could get you.


generally this is caused by overloaded servers which can't manage to
respond at all due to the amount of work they have in their backlog
queue. Please add "maxconn 50" for instance on each "server" line to
see if it changes anything. Also, what type of server are you using ?
For instance, mongrel only accepts one request at a time and will not
respond to any health-check while it's processing a long request, so
with it you need "maxconn 1".

> One question: couldn't it be possible to have redispatch work for TCP
> connections? 

it does. However you have one particular config, you're using "balance source"
with your TCP config. That means that when you redispatch the connection,
you apply the LB algorithm again and you can only get back to the same
server if it is still seen as up, because the size of the farm has not
changed. There are two workarounds for this :
 - don't use "balance source" when not needed :-)
 - add enough retries to cover for the time to detect the server down,
   taking into account that each attempt waits at least 1 second.

For the second solution, you can combine "inter" and "fastinter" to
lower the failure detection time. For instance, "inter 5s fastinter
1s fall 2" will take 5 + 2*1 = 7s to see the server as down. So with
at least 8 retries it should be OK. The redispatch will occur once
the server has been taken out of the farm, so the source hash
algorithm will bring you to another server.

Regards,
Willy

Re: 1.3.17 in TCP mode sees dead servers (but they're not)

Reply via email to