[Pound Mailing List] Timeouts, backend kills and resurrects

Albert Fri, 19 Jun 2009 10:52:40 -0700

Over the past few months we've noticed that the killing and resurrectionof backends was not done on time. We also noticed that some backendswere being killed, even though they were alive, but because of somenetwork related hiccups (we're still investigating them), pound wouldkill the backend.

I've spend the last couple of days debugging pound code. I found thatthere was not really any problems with pound code, but between ourconfiguration and pound code, we were running into these problems.


First, let me quickly describe our configuration for relevant variables:

TimeOut 180 -- very high, but we have some HTTP requests which can takea while to complete

Alive 15

Some of the backends have override for TimeOut for 15 sec, but ingeneral they should all be at 180 sec. We also have HAPort for each ofour backends (other than regular HTTP port). We have a custom app whichdisables HAPort for a server, when we need to take backend offline orreset HTTP service, etc. Otherwise, HAPort is always listening forpound, for Host-Alive checks.

We've run into a problem where one of our backends died, and stayed thatway for a while. This caused pound to run its resurrection code(do_resurect) every 3 minutes (our default TimeOut value). We trackedit down to the part of the code where pound is trying to connect to theserver in do_resurect(), and waits for 3 minutes before timing out. Asit waits, and since there is only 1 thread running do_resurect, the restof the servers are not being checked every 15 seconds, as intended by"Alive" value.

The problem, as I see it, is a lack of a separate variable for "ConnectTimeOut" vs "Time-Out for read/gets". Currently, pound uses the samevariable for both connecting and waiting on read/gets. The "ConnectTimeOut" can be an optional variable, with the default value of regularTimeOut.

We also have a related issue with the way pound kills backends whenconnect_nb fails to the regular "Port" of backend during an HTTPrequest. As I mentioned above, we've seen network hiccups where connectcalls time out, even though the backend is fine, and another connect atthe same time goes through. This has caused pound to kill the backendduring an HTTP request, if connect fails (and this happens 3 minutesafter initial call to connect_nb, during which time bunch of otherrequests have been completed). I was wondering, in case where an HAPortexist, should pound kill a backend if HAPort says it alive?I believe in such setup (where HAPort is defined), when connect_nbinside thr_http fails, pound should either:1. Do nothing with the backend(let do_resurect take the backend offlineif its dead), and get the next backend from the list of availableservers, or2. Check HAPort to see if the backend is alive, and take appropriateaction, or

3. Retry the connect_nb, and if fails again, take the backend offline, or

4. Track the failure, if reached some threshold value (i.e. 5consecutive failures), then take the backend offline.

The last one is a bit complicated, but would make sure the backend iseventually taken out of the pool if HAPort is still responding, but theHTTP service is not. On the other hand, if HAPort exists, then itsreally responsibility of the application running HAPort to do suchchecks, and refuse connections on the HAPort if HTTP service is dead (soone of the first 2 options would make more sense)

Maybe there is a simpler and more elegant solution for this type ofcondition, but I believe it needs to be handled differently than it isright now.

In summary, we'd like to see:

1. A separate ConnectTimeOut variable to be used on connects. TimeOutwould be used for read/gets, and also for connects if ConnectTimeOut isnot defined.

2. Don't automatically kill a backend, inside thr_http, if connect_nb fails.

Albert


--
To unsubscribe send an email with subject unsubscribe to [email protected].
Please contact [email protected] for questions.

[Pound Mailing List] Timeouts, backend kills and resurrects

Reply via email to