Hi, On Wed, Aug 12, 2009 at 12:50:00PM -0700, James Hartshorn wrote: > Hi, > > We run Haproxy on Amazon ec2 for http load balancing. On Monday > (august 11) we upgraded seven of our load balancers in two of our > products to 1.3.20 from 1.3.15.8 (four servers, all of one product) > and 1.3.18 (three servers, all of the other product). We kept the > config files the same. We finished replacing the load balancers by > 2300 UTC on aug 11, and at about 0900 UTC Aug 12 the first cluster > (the one upgraded from 1.3.15.8) started showing performance issues, > enough to cause our monitoring systems to go off. Response times were > several seconds.
please enable the stats page, it will show you a lot of useful info in such cases. Most specifically the session rate and concurrent sessions number. > Logging on to one of the load balancers I saw normal > cpu and memory, but looking at netstat -anp I saw more than 30k lines > there, the majority in TIME_WAIT state. TIME_WAIT is completely normal. Assuming your system is running with default settings (60s timeout on finwait/timewait), 30k TIME_WAIT sessions means you're getting 500 connections per second. > For background, the load > balancers each point to the same pool of about 60 servers, which at > the time were doing about 20-30 sessions per server, and the servers > reporting about 80 requests per second (nominally 60% of peak). 80 req/s cumulated or per server ? It seems extremely low for a cumulative count, but if it's per server, it means 4800 req/s cumulated which is approximately in the range of what we have observed on another site running at EC2, the limit certainly being caused by virtualization and/or shared CPU resources. > At > this point we put the old load balancers back into production and > found them to be still working fine. that's what I find strange then :-/ > At around 1200 UTC Aug 12 a > nearly identical state occured on the other set of load balancers (the > ones upgraded from 1.3.18). > > If anyone can see any issues please let me know. > > I have pasted a representative haproxy.cfg file below: > global > #log 127.0.0.1 local0 info > #log 127.0.0.1 local1 notice > #log loghost local0 info > maxconn 75000 > chroot /var/lib/haproxy > user haproxy > group haproxy > daemon > #debug > #quiet > > defaults > #log global > mode http > #option httplog > option dontlognull > option redispatch > retries 3 > maxconn 75000 your defaults and frontend maxconns should be slightly lower than the global one, so that one single frontend can never fill the whole process. BTW, 75000 seems a bit optimistic for a virtualized environment... > contimeout 5000 > clitimeout 50000 > srvtimeout 2000 Is there a reason for 50s on the client and only 2s on the server ? I suppose that when your servers slow down, you're killing a lot of requests by sending 504 responses. > frontend openx *:80 > #log global > maxconn 75000 > option forwardfor > default_backend openx_ec2_hosted_http > > backend openx_ec2_hosted_http > mode http > #balance roundrobin > balance leastconn > option abortonclose > option httpclose > #remove the line below if not 1.3.20 > #option httpchk HEAD /health.chk Why is there a special case for this line and 1.3.20 ? Are you sure you don't change it when you switch to another version ? If so, it may be the reason why your servers may be flapping. > timeout queue 500 Same here, 500ms for a queue seems very short (but looks consistent with the 2s for the server though). > #option forceclose Just in case you'd have enabled it, avoid using forceclose, as you may reach a point where the system is refusing to allocate a source port for haproxy to connect to the server. (...) Other than the points above, I don't see anything really wrong. Please do enable the stats and save a report. Check the "Dwn" and "Chk" columns for your servers. You might notice they're flapping because they'd take too much time to respond to health checks. Regards, Willy

