Hi,

On Wed, Aug 12, 2009 at 12:50:00PM -0700, James Hartshorn wrote:
> Hi,
> 
> We run Haproxy on Amazon ec2 for http load balancing.  On Monday
> (august 11) we upgraded seven of our load balancers in two of our
> products to 1.3.20 from 1.3.15.8 (four servers, all of one product)
> and 1.3.18 (three servers, all of the other product).  We kept the
> config files the same.  We finished replacing the load balancers by
> 2300 UTC on aug 11, and at about 0900 UTC Aug 12 the first cluster
> (the one upgraded from 1.3.15.8) started showing performance issues,
> enough to cause our monitoring systems to go off.  Response times were
> several seconds.

please enable the stats page, it will show you a lot of useful info
in such cases. Most specifically the session rate and concurrent
sessions number.

>  Logging on to one of the load balancers I saw normal
> cpu and memory, but looking at netstat -anp I saw more than 30k lines
> there, the majority in TIME_WAIT state.

TIME_WAIT is completely normal. Assuming your system is running with
default settings (60s timeout on finwait/timewait), 30k TIME_WAIT
sessions means you're getting 500 connections per second.

>  For background, the load
> balancers each point to the same pool of about 60 servers, which at
> the time were doing about 20-30 sessions per server, and the servers
> reporting about 80 requests per second (nominally 60% of peak).

80 req/s cumulated or per server ? It seems extremely low for a
cumulative count, but if it's per server, it means 4800 req/s
cumulated which is approximately in the range of what we have
observed on another site running at EC2, the limit certainly
being caused by virtualization and/or shared CPU resources.

>  At
> this point we put the old load balancers back into production and
> found them to be still working fine.

that's what I find strange then :-/

>  At around 1200 UTC Aug 12 a
> nearly identical state occured on the other set of load balancers (the
> ones upgraded from 1.3.18).
> 
> If anyone can see any issues please let me know.
> 
> I have pasted a representative haproxy.cfg file below:

> global
> #log 127.0.0.1 local0 info
> #log 127.0.0.1 local1 notice
> #log loghost local0 info
> maxconn 75000
> chroot /var/lib/haproxy
> user haproxy
> group haproxy
> daemon
> #debug
> #quiet
> 
> defaults
> #log global
> mode http
> #option httplog
> option dontlognull
>     option  redispatch
> retries 3
> maxconn 75000

your defaults and frontend maxconns should be slightly lower than
the global one, so that one single frontend can never fill the
whole process. BTW, 75000 seems a bit optimistic for a virtualized
environment...

> contimeout 5000
> clitimeout 50000
> srvtimeout 2000

Is there a reason for 50s on the client and only 2s on the server ?
I suppose that when your servers slow down, you're killing a lot
of requests by sending 504 responses.

> frontend openx *:80
> #log global
> maxconn 75000
>        option forwardfor
>        default_backend openx_ec2_hosted_http
> 
> backend openx_ec2_hosted_http
>        mode http
>        #balance roundrobin
>        balance leastconn
>        option abortonclose
>        option httpclose
>        #remove the line below if not 1.3.20
>        #option httpchk HEAD /health.chk

Why is there a special case for this line and 1.3.20 ? Are you
sure you don't change it when you switch to another version ?
If so, it may be the reason why your servers may be flapping.

>        timeout queue 500

Same here, 500ms for a queue seems very short (but looks
consistent with the 2s for the server though).

>        #option forceclose

Just in case you'd have enabled it, avoid using forceclose, as
you may reach a point where the system is refusing to allocate
a source port for haproxy to connect to the server.

(...)

Other than the points above, I don't see anything really wrong.
Please do enable the stats and save a report. Check the "Dwn"
and "Chk" columns for your servers. You might notice they're
flapping because they'd take too much time to respond to health
checks.

Regards,
Willy


Reply via email to