Did you get any answers regarding this? I have pretty much the same problem with TimeOut. It would be great to differentiate between connect, first byte and last byte (perhaps even dns if ppl use this).
On Fri, Jun 19, 2009 at 7:49 PM, Albert <[email protected]> wrote: > Over the past few months we've noticed that the killing and resurrection of > backends was not done on time. We also noticed that some backends were > being killed, even though they were alive, but because of some network > related hiccups (we're still investigating them), pound would kill the > backend. > > I've spend the last couple of days debugging pound code. I found that > there was not really any problems with pound code, but between our > configuration and pound code, we were running into these problems. > > First, let me quickly describe our configuration for relevant variables: > TimeOut 180 -- very high, but we have some HTTP requests which can take a > while to complete > Alive 15 > > Some of the backends have override for TimeOut for 15 sec, but in general > they should all be at 180 sec. We also have HAPort for each of our backends > (other than regular HTTP port). We have a custom app which disables HAPort > for a server, when we need to take backend offline or reset HTTP service, > etc. Otherwise, HAPort is always listening for pound, for Host-Alive > checks. > > We've run into a problem where one of our backends died, and stayed that > way for a while. This caused pound to run its resurrection code > (do_resurect) every 3 minutes (our default TimeOut value). We tracked it > down to the part of the code where pound is trying to connect to the server > in do_resurect(), and waits for 3 minutes before timing out. As it waits, > and since there is only 1 thread running do_resurect, the rest of the > servers are not being checked every 15 seconds, as intended by "Alive" > value. > > The problem, as I see it, is a lack of a separate variable for "Connect > TimeOut" vs "Time-Out for read/gets". Currently, pound uses the same > variable for both connecting and waiting on read/gets. The "Connect > TimeOut" can be an optional variable, with the default value of regular > TimeOut. > > We also have a related issue with the way pound kills backends when > connect_nb fails to the regular "Port" of backend during an HTTP request. > As I mentioned above, we've seen network hiccups where connect calls time > out, even though the backend is fine, and another connect at the same time > goes through. This has caused pound to kill the backend during an HTTP > request, if connect fails (and this happens 3 minutes after initial call to > connect_nb, during which time bunch of other requests have been completed). > I was wondering, in case where an HAPort exist, should pound kill a backend > if HAPort says it alive? > I believe in such setup (where HAPort is defined), when connect_nb inside > thr_http fails, pound should either: > 1. Do nothing with the backend(let do_resurect take the backend offline if > its dead), and get the next backend from the list of available servers, or > 2. Check HAPort to see if the backend is alive, and take appropriate > action, or > 3. Retry the connect_nb, and if fails again, take the backend offline, or > 4. Track the failure, if reached some threshold value (i.e. 5 consecutive > failures), then take the backend offline. > > The last one is a bit complicated, but would make sure the backend is > eventually taken out of the pool if HAPort is still responding, but the HTTP > service is not. On the other hand, if HAPort exists, then its really > responsibility of the application running HAPort to do such checks, and > refuse connections on the HAPort if HTTP service is dead (so one of the > first 2 options would make more sense) > > Maybe there is a simpler and more elegant solution for this type of > condition, but I believe it needs to be handled differently than it is right > now. > In summary, we'd like to see: > 1. A separate ConnectTimeOut variable to be used on connects. TimeOut > would be used for read/gets, and also for connects if ConnectTimeOut is not > defined. > 2. Don't automatically kill a backend, inside thr_http, if connect_nb > fails. > > Albert > > > -- > To unsubscribe send an email with subject unsubscribe to [email protected]. > Please contact [email protected] for questions. > -- Mattias Berge Direct +46 (0)40-690 3825 -- To unsubscribe send an email with subject unsubscribe to [email protected]. Please contact [email protected] for questions.
