Over the past few months we've noticed that the killing and resurrection
of backends was not done on time. We also noticed that some backends
were being killed, even though they were alive, but because of some
network related hiccups (we're still investigating them), pound would
kill the backend.
I've spend the last couple of days debugging pound code. I found that
there was not really any problems with pound code, but between our
configuration and pound code, we were running into these problems.
First, let me quickly describe our configuration for relevant variables:
TimeOut 180 -- very high, but we have some HTTP requests which can take
a while to complete
Alive 15
Some of the backends have override for TimeOut for 15 sec, but in
general they should all be at 180 sec. We also have HAPort for each of
our backends (other than regular HTTP port). We have a custom app which
disables HAPort for a server, when we need to take backend offline or
reset HTTP service, etc. Otherwise, HAPort is always listening for
pound, for Host-Alive checks.
We've run into a problem where one of our backends died, and stayed that
way for a while. This caused pound to run its resurrection code
(do_resurect) every 3 minutes (our default TimeOut value). We tracked
it down to the part of the code where pound is trying to connect to the
server in do_resurect(), and waits for 3 minutes before timing out. As
it waits, and since there is only 1 thread running do_resurect, the rest
of the servers are not being checked every 15 seconds, as intended by
"Alive" value.
The problem, as I see it, is a lack of a separate variable for "Connect
TimeOut" vs "Time-Out for read/gets". Currently, pound uses the same
variable for both connecting and waiting on read/gets. The "Connect
TimeOut" can be an optional variable, with the default value of regular
TimeOut.
We also have a related issue with the way pound kills backends when
connect_nb fails to the regular "Port" of backend during an HTTP
request. As I mentioned above, we've seen network hiccups where connect
calls time out, even though the backend is fine, and another connect at
the same time goes through. This has caused pound to kill the backend
during an HTTP request, if connect fails (and this happens 3 minutes
after initial call to connect_nb, during which time bunch of other
requests have been completed). I was wondering, in case where an HAPort
exist, should pound kill a backend if HAPort says it alive?
I believe in such setup (where HAPort is defined), when connect_nb
inside thr_http fails, pound should either:
1. Do nothing with the backend(let do_resurect take the backend offline
if its dead), and get the next backend from the list of available
servers, or
2. Check HAPort to see if the backend is alive, and take appropriate
action, or
3. Retry the connect_nb, and if fails again, take the backend offline, or
4. Track the failure, if reached some threshold value (i.e. 5
consecutive failures), then take the backend offline.
The last one is a bit complicated, but would make sure the backend is
eventually taken out of the pool if HAPort is still responding, but the
HTTP service is not. On the other hand, if HAPort exists, then its
really responsibility of the application running HAPort to do such
checks, and refuse connections on the HAPort if HTTP service is dead (so
one of the first 2 options would make more sense)
Maybe there is a simpler and more elegant solution for this type of
condition, but I believe it needs to be handled differently than it is
right now.
In summary, we'd like to see:
1. A separate ConnectTimeOut variable to be used on connects. TimeOut
would be used for read/gets, and also for connects if ConnectTimeOut is
not defined.
2. Don't automatically kill a backend, inside thr_http, if connect_nb fails.
Albert
--
To unsubscribe send an email with subject unsubscribe to [email protected].
Please contact [email protected] for questions.