504 errors

Alex Davies Tue, 13 Sep 2011 05:26:50 -0700

Hi,

I am not a haproxy expert, but have been using it in production for some
time with excellent results and I wonder if I can seek some expert advice on
running the fairly fast application server http://www.tornadoweb.org/ behind
HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64). Haproxy is
working very well for me, but i'm looking for help understanding how I can
diagnose problems with the Tornado application I have running behind it.


I have ~8 tornado processes running on two servers. Its important that one
is active and the other is failover (some state is stored in memory). The
parts of my haproxy configuration relevant to my question are below. I
notice a large number of entries in the logs like this:

502 errors:
Sep 13 12:42:45 localhost haproxy[15128]:
188.222.50.208:61001[13/Sep/2011:12:42:43.881] main
python_8001/python_8001_fe1 10/0/0/-1/1527
502 204 - - SH-- 6676/6676/2082/2082/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"
Sep 13 12:42:45 localhost haproxy[15128]:
81.246.46.162:29456[13/Sep/2011:12:42:14.289] main
python_8001/python_8001_fe1 28/0/0/-1/31118
502 204 - - SH-- 6675/6675/2081/2081/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"

504 errors:
Sep 13 12:43:08 localhost haproxy[15128]:
180.234.122.248:52888[13/Sep/2011:12:38:08.822] main
python_9004/python_9004_fe1 45/0/0/-1/300045
504 194 - - sH-- 6607/6607/697/697/0 0/0 "POST /xxx/chat/message/4/updates
HTTP/1.1"
Sep 13 12:43:09 localhost haproxy[15128]:
82.26.136.198:61758[13/Sep/2011:12:38:09.071] main
python_8001/python_8001_fe1 19/0/0/-1/300020
504 194 - - sH-- 6569/6569/2085/2085/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"

It seems to me that all of these involve 0 seconds waiting in a queue, 0
seconds to make a connection to the final app server and then a aborted
connection to the app server before a complete response could be received
The total time in milliseconds between accept and last close seems to be
~300 seconds for most of the requests (although far from all of them, as the
first entry shows). If I *restart* (not reload) haproxy, I still get lines
with the fifth of these numbers ("Tt" in the docs) as ~300,000 (the "timeout
server" value in the config at the moment I copied the logs above), a few
seconds after the haproxy process starts. I also get lots that seem to end
on almost exactly 300k even when I change both "timeout client" and "timeout
server" to very different numbers. It is possible that the application
(jQuery) has a 300s timeout hardcoded, but in any case I do not understand
why the haproxy logs show connections with a connection of 300k failing when
I stop haproxy, increase "timeout server" by a order of magnitude and start
it again.

Looking at the next part of the log entries it seems that bytes_read is
always 204 for 504 errors and 194 for 504 errors. This does seem to be a
fairly regular pattern:
[root@frontend2 log]# cat /var/log/haproxy.log | grep " 504 194" | wc -l
1975
[root@frontend2 log]# cat /var/log/haproxy.log | grep " 502 204" | wc -l
12401
[root@frontend2 log]# cat /var/log/haproxy.log | wc -l
18721

 My second question is how do I find out exactly what is being returned
(easily), i.e. what are those 194/204 bytes? This might give me a hint as to
what is going wrong or timing out on the application server. I guess I could
try to tcpdump but I might struggle to actually filter down the correct data
(there are large numbers of successful connections going on)

The next part of the logs are most interesting; ignoring the two cookie
fields we see that the "the server-side timeout expired" for the 504 errors
and "the TCP session was unexpectedly aborted by the server, or the server
explicitly refused it" in the case of the 502. Subject to my questions above
I have a theory that the 504s are caused by the long-pooling application,
but I do not understand why in the case of the 502 haproxy is not retrying
the TCP connection before returning a 502 - I thought that the "option
redispatch" and "retries 10" would ensure another go.

If anybody is able to shed some thoughts on my two questions I would be very
grateful!

Many thanks,

Alex

# haproxy.conf
global
    log         127.0.0.1 local4 debug

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     50000
    user        haproxy
    group       haproxy
    daemon

defaults
    mode              http
    option            httplog
    #option tcplog
    option            dontlognull
    option            dontlog-normal

    log               global
    retries           10
    maxconn           50000
    timeout connect   200000
    contimeout        200000
    clitimeout        90000
    option forwardfor except 127.0.0.1/32 # Apache running https on
localhost
    option httpclose # Required for REMOTE HEADER
    option redispatch
    timeout connect 10000
    timeout client  300000
    timeout server  300000

frontend  main *:80
    acl url_py_8001       path_beg       -i /url1
    acl url_py_8002       path_beg       -i /url2
    # ~10 others

    use_backend           python_8001    if url_py_8001
    use_backend           python_8002    if url_py_8002
    # ~10 others

    default_backend       webservers

# In all of these we use a host file to allow easy changes
backend python_8001
    server      python_8001_fe1 frontend1.xxx.com:8001 check
    server      python_8001_fe2 frontend2.xxx.com:8001 check backup

backend python_8002
    server      python_8002_fe1 frontend1.xxx.com:8002 check
    server      python_8002_fe2 frontend2.xxx.com:8002 check backup
(and so on, for 8001 to 8005, 9001 - 9005)

haproxy / Python 'tornado' framework - digging into 502/504 errors

Reply via email to