Hello, On Fri, Oct 23, 2009 at 04:34:49PM -0400, Michael Kushnir wrote: > Hello, > > I have the following configuration for my webfarm: > > 7 servers, all running Centos 5. Two (LB1 and LB2) are HAproxy load > balancers, Web1-5 are Apache2 and Lighttpd web servers (I tried both > as a method of elimination test on the side of the web servers). All > of the servers are dual xeon Dell PE1850s with 4GB of ram, or better. > I am using HAproxy 1.3.15.2. The webfarm is used for adserving using > OpenX. > > Here is the problem I am having: > > Once I get in the neighborhood of 1000 sessions the stats page on the > active LB starts showing that the web servers are going up and down > randomly (all servers even those set as backup and not getting any > traffic). When I monitor the servers in real time and check on the > stats page of the standby LB (the hot spare that still monitors > servers) everything looks fine and all servers are running green with > no problems. As a result of HAproxy thinking that servers are going up > and down, clients are getting errors. > > The apache servers are set up in prefork mode with a ServerLimit at > 2000, MaxClients at 2000, and MaxRequestsPerChild at 4000.
can you check that you have enough processes started when the problem happens ? I'm asking, because apache takes a very long time to start additional processes. As a workaround, you can set StartServers to your desired value (and I believe MinSpareServers too, though I'm not certain). > The server > load per server has never gotten to 2000, and there is plenty of CPU > and Ram to spare on each machine even during heavy load. > > I am guessing that the issue has to do with the active HAproxy server > running out of something or other and losing the ability to poll the > web servers under this load. There is barely any usage of CPU or Ram > on the LB itself, so I don't think its a hardware issue. Below is my > HAproxy config file. Please note that because this is an ad server > distribution system, I don't need the proxy to keep track of sessions > or send the same user to the same web server as each request is a full > self contained operation. It is very possible that you're running out of file descriptors on your haproxy process. Also, do you have ip_conntrack / nf_conntrack loaded on there ? Maybe you have a wrong setting limiting the number of concurrent connections ? "dmesg" should tell you in this case. > global > log 127.0.0.1 local0 notice > #log 127.0.0.1 local1 notice > #log loghost local0 info > maxconn 20000 > #debug > #quiet > ulimit-n 25000 > user root <---- set because lit says ulimit-n above requires > group root <---- set because lit says ulimit-n above requires This is not required. You need to *start* the process as root, but it sets the ulimit before dropping privileges. So you can safely use another user/group setting here. (...) > listen webfarm > bind :80 > mode http > stats enable > stats auth ***:*** > balance roundrobin > #cookie SERVERID insert indirect nocache > option httpclose > option forwardfor > option httpchk GET /check.txt HTTP/1.0 > #server web1.****.com ***.***.***.***:80 weight 1 check maxconn > 200 inter 1000 rise 2 fall 5 <--- commented out, reserved for admin > interface > server web2.****.com ***.***.***.***:80 weight 3 check maxconn > 3000 inter 1000 rise 2 fall 5 backup > server web3.****.com ***.***.***.***:80 weight 3 check maxconn > 3000 inter 1000 rise 2 fall 5 > server web4.****.com ***.***.***.***:80 weight 3 check maxconn > 3000 inter 1000 rise 2 fall 5 > > and so forth, same for the rest of the servers... > > I have tried using HTTP/1.1 for the httpchk option, but that results > in all servers being shown as down in the stats. I have also tried > varying the inter variable from 500 to 4000 with no change in > behavior. Please let me know if you can suggest something. I am > guessing some operating system variables need to be tweaked. You should try to disable "option httpchk", and see if it makes any difference. If it does, it means that apache is not responding to the request (most likely not enough processes started). If it does not change anything, it is very possible that you have trouble creating a new outgoing connection for one of the reasons above. Then you should not try to hide the issue using larger intervals because it means that your production traffic is affected by the issue too. By the way, please check your logs for connection failures or response timeouts. If you're interested, in version 1.4-dev, there is a new feature which tells you on the stats page where a check failed (L4, L7, ...). It can help in circumstances like yours. I'm thinking about something else. I suppose you have a second LB serving as a backup. Could you check if it sees failed checks too ? This will tell you if the problem is on the proxy side or on the server side. Regards, Willy

