Hi Matt,

On Wed, Jul 28, 2010 at 12:29:15PM -0600, Matt Banks wrote:
> OK, this is somewhat funny, but I'm mostly done with this email and a VERY 
> similar sounding problem was just asked a few minutes ago...
> 
> All,
> 
> Long story short(ish):
> 
> We put haproxy in front of a few servers that generate dynamic pages from a 
> database.  Here's a crude description of the setup:
> 
> HAProxy -> 2 to 10 Apache servers -> Gateway (connection to db) -> Local 
> caching database server ---(LAN or WAN)-> Database
> 
> The point is that if the page is cached, the local caching db server will 
> reply very fast.  If not, it may take a few seconds to respond.

Those are precisely multiples of 3 seconds I guess ?

> We've also found that we basically HAVE to use keep alive (eg loading an 
> image takes well under a second to load without HAProxy and perhaps .5 to 1.5 
> seconds with keepalive on whereas with keepalive off, the same image on the 
> same page takes 12-18 seconds) if that makes a difference.

Yes, with keep-alive, you have one session, without you have many.
Losing a SYN or a SYN/ACK when establishing a connection implies a
3 second retransmit delay. So with keep-alive disabled, each object
comes in a separate session, causing more connection establishments,
then amplifying the retransmission delay.

> Here's where things get a bit... tricky?
> 
> We have httpcheck disabled.  This is essentially because it's not working for 
> us - at least how we'd like it to be.  In a nutshell, we're getting a LOT of 
> false positives where a server is listed as "up going down" or down when in 
> reality, a non-cached page was simply taking a couple seconds (probably 3-5 
> but definitely less than 10) to load.

This is also typical of high packet loss rate.

> The point is, we get several 503 errors throughout the day.  And they appear 
> to be random.  Apache never goes down nor reports an error.  Frankly, I think 
> what's happening is that haproxy is hitting a server which takes too long to 
> respond, so it tries another server (which also doesn't have the page cached) 
> and goes through the list until it gives up and reports a 503.

In my opinion, what it happening is that something is causing connections
to fail between haproxy and the servers (since health checks fail too).
There are two common causes for this :

  - a network card connected to a forced 100-full switch. Almost all
    gigabit cards will negociate 100-half if the switch does not advertise
    anything, causing a huge packet loss rate. You can easily check on
    your server using ethtool :

       ethtool eth0

  - a mis-configured netfilter which remains enabled on the haproxy
    machine (the default settings of the conntrack table are too small
    to support a moderate load). You can see messages like "conntrack
    table is full" in "dmesg". Just in case, you should completely
    unload the nf_conntrack / ip_conntrack modules from the machine.

You could also try to run an FTP test from/to the haproxy machine. You
should easily be able to saturate the port when transferring large files
(approx 11800 kB/s on 100 Mbps, 118 MB/s on 1 Gbps). Any significantly
lower value indicates a communication trouble. This will show you where
the network runs well and where it runs poorly. Sometimes this is as
simple as a broken NIC, wire or switch port (the later happened to me
several time).

> Meanwhile, if you go directly to the page on the Apache server, it loads 
> fine.  Or if you re-load using HAProxy, it works fine as well.
> 
> I'm just wondering where to start with this.  We have several sites 
> experiencing the same problem, but since we're using roughly the same setup 
> for each one, I'm not opposed to saying it could be how we have HAProxy set 
> up.

There is no particular reason your config could cause such things to
happen and you could definitely not cause the checks to randomly fail.
That's why I'm suggesting environment issues, which are a very recurring
concern.

Regards,
Willy


Reply via email to