Hi,
I am not a haproxy expert, but have been using it in production for some
time with excellent results and I wonder if I can seek some expert advice on
running the fairly fast application server http://www.tornadoweb.org/ behind
HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64). Haproxy is
working very well for me, but i'm looking for help understanding how I can
diagnose problems with the Tornado application I have running behind it.
I have ~8 tornado processes running on two servers. Its important that one
is active and the other is failover (some state is stored in memory). The
parts of my haproxy configuration relevant to my question are below. I
notice a large number of entries in the logs like this:
502 errors:
Sep 13 12:42:45 localhost haproxy[15128]:
188.222.50.208:61001[13/Sep/2011:12:42:43.881] main
python_8001/python_8001_fe1 10/0/0/-1/1527
502 204 - - SH-- 6676/6676/2082/2082/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"
Sep 13 12:42:45 localhost haproxy[15128]:
81.246.46.162:29456[13/Sep/2011:12:42:14.289] main
python_8001/python_8001_fe1 28/0/0/-1/31118
502 204 - - SH-- 6675/6675/2081/2081/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"
504 errors:
Sep 13 12:43:08 localhost haproxy[15128]:
180.234.122.248:52888[13/Sep/2011:12:38:08.822] main
python_9004/python_9004_fe1 45/0/0/-1/300045
504 194 - - sH-- 6607/6607/697/697/0 0/0 "POST /xxx/chat/message/4/updates
HTTP/1.1"
Sep 13 12:43:09 localhost haproxy[15128]:
82.26.136.198:61758[13/Sep/2011:12:38:09.071] main
python_8001/python_8001_fe1 19/0/0/-1/300020
504 194 - - sH-- 6569/6569/2085/2085/0 0/0 "POST /xxx/chat/status/updates
HTTP/1.1"
It seems to me that all of these involve 0 seconds waiting in a queue, 0
seconds to make a connection to the final app server and then a aborted
connection to the app server before a complete response could be received
The total time in milliseconds between accept and last close seems to be
~300 seconds for most of the requests (although far from all of them, as the
first entry shows). If I *restart* (not reload) haproxy, I still get lines
with the fifth of these numbers ("Tt" in the docs) as ~300,000 (the "timeout
server" value in the config at the moment I copied the logs above), a few
seconds after the haproxy process starts. I also get lots that seem to end
on almost exactly 300k even when I change both "timeout client" and "timeout
server" to very different numbers. It is possible that the application
(jQuery) has a 300s timeout hardcoded, but in any case I do not understand
why the haproxy logs show connections with a connection of 300k failing when
I stop haproxy, increase "timeout server" by a order of magnitude and start
it again.
Looking at the next part of the log entries it seems that bytes_read is
always 204 for 504 errors and 194 for 504 errors. This does seem to be a
fairly regular pattern:
[root@frontend2 log]# cat /var/log/haproxy.log | grep " 504 194" | wc -l
1975
[root@frontend2 log]# cat /var/log/haproxy.log | grep " 502 204" | wc -l
12401
[root@frontend2 log]# cat /var/log/haproxy.log | wc -l
18721
My second question is how do I find out exactly what is being returned
(easily), i.e. what are those 194/204 bytes? This might give me a hint as to
what is going wrong or timing out on the application server. I guess I could
try to tcpdump but I might struggle to actually filter down the correct data
(there are large numbers of successful connections going on)
The next part of the logs are most interesting; ignoring the two cookie
fields we see that the "the server-side timeout expired" for the 504 errors
and "the TCP session was unexpectedly aborted by the server, or the server
explicitly refused it" in the case of the 502. Subject to my questions above
I have a theory that the 504s are caused by the long-pooling application,
but I do not understand why in the case of the 502 haproxy is not retrying
the TCP connection before returning a 502 - I thought that the "option
redispatch" and "retries 10" would ensure another go.
If anybody is able to shed some thoughts on my two questions I would be very
grateful!
Many thanks,
Alex
# haproxy.conf
global
log 127.0.0.1 local4 debug
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 50000
user haproxy
group haproxy
daemon
defaults
mode http
option httplog
#option tcplog
option dontlognull
option dontlog-normal
log global
retries 10
maxconn 50000
timeout connect 200000
contimeout 200000
clitimeout 90000
option forwardfor except 127.0.0.1/32 # Apache running https on
localhost
option httpclose # Required for REMOTE HEADER
option redispatch
timeout connect 10000
timeout client 300000
timeout server 300000
frontend main *:80
acl url_py_8001 path_beg -i /url1
acl url_py_8002 path_beg -i /url2
# ~10 others
use_backend python_8001 if url_py_8001
use_backend python_8002 if url_py_8002
# ~10 others
default_backend webservers
# In all of these we use a host file to allow easy changes
backend python_8001
server python_8001_fe1 frontend1.xxx.com:8001 check
server python_8001_fe2 frontend2.xxx.com:8001 check backup
backend python_8002
server python_8002_fe1 frontend1.xxx.com:8002 check
server python_8002_fe2 frontend2.xxx.com:8002 check backup
(and so on, for 8001 to 8005, 9001 - 9005)