Hi again Alex,

Le Mardi 13 Septembre 2011 13:26:04 Alex Davies a écrit :
> Hi,
> 
> I am not a haproxy expert, but have been using it in production for some
> time with excellent results and I wonder if I can seek some expert advice on
> running the fairly fast application server http://www.tornadoweb.org/
> behind HAproxy (haproxy-1.3.23 using the EPEL RPM (-1) on RHEL6 x86_64).
> Haproxy is working very well for me, but i'm looking for help understanding
> how I can diagnose problems with the Tornado application I have running
> behind it.
> 
> I have ~8 tornado processes running on two servers. Its important that one
> is active and the other is failover (some state is stored in memory). The
> parts of my haproxy configuration relevant to my question are below. I
> notice a large number of entries in the logs like this:
> 
> 502 errors:
> Sep 13 12:42:45 localhost haproxy[15128]:
> 188.222.50.208:61001[13/Sep/2011:12:42:43.881] main
> python_8001/python_8001_fe1 10/0/0/-1/1527
> 502 204 - - SH-- 6676/6676/2082/2082/0 0/0 "POST /xxx/chat/status/updates
> HTTP/1.1"
> Sep 13 12:42:45 localhost haproxy[15128]:
> 81.246.46.162:29456[13/Sep/2011:12:42:14.289] main
> python_8001/python_8001_fe1 28/0/0/-1/31118
> 502 204 - - SH-- 6675/6675/2081/2081/0 0/0 "POST /xxx/chat/status/updates
> HTTP/1.1"
> 
> 504 errors:
> Sep 13 12:43:08 localhost haproxy[15128]:
> 180.234.122.248:52888[13/Sep/2011:12:38:08.822] main
> python_9004/python_9004_fe1 45/0/0/-1/300045
> 504 194 - - sH-- 6607/6607/697/697/0 0/0 "POST /xxx/chat/message/4/updates
> HTTP/1.1"
> Sep 13 12:43:09 localhost haproxy[15128]:
> 82.26.136.198:61758[13/Sep/2011:12:38:09.071] main
> python_8001/python_8001_fe1 19/0/0/-1/300020
> 504 194 - - sH-- 6569/6569/2085/2085/0 0/0 "POST /xxx/chat/status/updates
> HTTP/1.1"
> 
> It seems to me that all of these involve 0 seconds waiting in a queue, 0
> seconds to make a connection to the final app server and then a aborted
> connection to the app server before a complete response could be received
> The total time in milliseconds between accept and last close seems to be
> ~300 seconds for most of the requests (although far from all of them, as the
> first entry shows).

I wonder if you've not reached a limit on your tornado servers, for example 
the max number of open files. Do the servers go down when it happens (due to 
the "check" keyword that only performs a layer 4 check in your configuration) 
?

> If I *restart* (not reload) haproxy, I still get lines
> with the fifth of these numbers ("Tt" in the docs) as ~300,000 (the
> "timeout server" value in the config at the moment I copied the logs
> above), a few seconds after the haproxy process starts.

What you describe makes me think of a syslog asynchronous configuration. That 
could explain why you see logs after the restart.
When it happens, can you verify that the pid logged is really the pid of the 
new instance or if it's the old one ?

In your example, your instance has the pid 15128.

> I also get lots
> that seem to end on almost exactly 300k even when I change both "timeout
> client" and "timeout server" to very different numbers. It is possible that
> the application (jQuery) has a 300s timeout hardcoded, but in any case I do
> not understand why the haproxy logs show connections with a connection of
> 300k failing when I stop haproxy, increase "timeout server" by a order of
> magnitude and start it again.
> 
> Looking at the next part of the log entries it seems that bytes_read is
> always 204 for 504 errors and 194 for 504 errors. This does seem to be a
> fairly regular pattern:
> [root@frontend2 log]# cat /var/log/haproxy.log | grep " 504 194" | wc -l
> 1975
> [root@frontend2 log]# cat /var/log/haproxy.log | grep " 502 204" | wc -l
> 12401
> [root@frontend2 log]# cat /var/log/haproxy.log | wc -l
> 18721
> 
>  My second question is how do I find out exactly what is being returned
> (easily), i.e. what are those 194/204 bytes? This might give me a hint as to
> what is going wrong or timing out on the application server.I guess I
> could try to tcpdump but I might struggle to actually filter down the
> correct data (there are large numbers of successful connections going on)

Those sizes correspond respectively to the 504 and 502 responses sent by 
haproxy.

        [HTTP_ERR_502] =
        "HTTP/1.0 502 Bad Gateway\r\n"
        "Cache-Control: no-cache\r\n"
        "Connection: close\r\n"
        "Content-Type: text/html\r\n"
        "\r\n"
        "<html><body><h1>502 Bad Gateway</h1>\nThe server returned an invalid 
or 
incomplete response.\n</body></html>\n",

        [HTTP_ERR_504] =
        "HTTP/1.0 504 Gateway Time-out\r\n"
        "Cache-Control: no-cache\r\n"
        "Connection: close\r\n"
        "Content-Type: text/html\r\n"
        "\r\n"
        "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond 
in 
time.\n</body></html>\n",

(you can find them in src/proto_http.c)

> The next part of the logs are most interesting; ignoring the two cookie
> fields we see that the "the server-side timeout expired" for the 504 errors
> and "the TCP session was unexpectedly aborted by the server, or the server
> explicitly refused it" in the case of the 502. Subject to my questions above
> I have a theory that the 504s are caused by the long-pooling application,

oh well, I didn't understand you on my first reads. You mean "long-polling".
That can explain your 504 errors, depending on your long-polling connection 
timeout.

> but I do not understand why in the case of the 502 haproxy is not retrying
> the TCP connection before returning a 502 - I thought that the "option
> redispatch" and "retries 10" would ensure another go.

"option redispatch" works for server that are down. If the request has already 
been sent to a server, haproxy can't resent it to a new one.

-- 
Cyril Bonté

Reply via email to