On Fri, Jul 30, 2010 at 04:33:16PM -0400, Phil Dupont wrote:
> I guess that's my point... I don't know where else to check.  There's
> nothing between the HAProxy box and the webservers but a switch.  Certainly
> nothing I can get logs from.
> 
> If I drill into an example, I end up with nothing.  See below for the logs
> of one scenario I tried to trace.  But essentially:
> 
>    - I see an HAProxy 504 error for someone trying to load
>    http://scoutmob.com/new-york/howitworks
>    - Then I check the apache access logs (not the error logs because the
>    error logs don't contain the user's IP addess, just the proxy's IP address)
>    and see the connection was successful (passed a 200)

Unfortunately there aren't any timer in your apache logs. What you see is
that apache correctly received the request and correctly responded. But
you don't know how much time it took for that. If it takes more than 50s
you have configured on your haproxy as the server timeout, the response
will never get back to the client. From the doc below, it seems it's "%D"
that you need to add to CustomLog to add the timer in the logs :

    http://httpd.apache.org/docs/2.2/mod/mod_log_config.html

>    - Then I run request-log-analyzer against the ruby log and see that the
>    method that creates the /new-york/howitworks takes a max of .01s to load.

Most often when you see a timeout between two levels of proxies and a ruby
server, it's because it took ages to the ruby server to accept the request
(for whatever reasons, starting with mongrel's 1-connection limit). Thus,
while the second proxy waits for the connection to establish, the first one
waits for the second proxy to respond, as it does not know it can't even
connect. Sometimes even the second proxy can connect but its request remains
for ages in a queue, waiting for being dispatched. Your ruby logs just
indicate it took 10ms to process the request once it got there. There's a
similar problem with some Java-based application servers that make use of
thread pools with a front dispatched using a deep queue.

But if you look below, there's no doubt it really took 50 seconds to
timeout :

> Jul 29 20:21:47 localhost haproxy[17405]: 
> 68.174.74.20:55350[29/Jul/2010:20:20:57.478] http_proxy primary_farm/WS01 
> 4/0/1/-1/50006 504 194 - - sH-- 14/14/14/13/0 0/0 "GET /new-york/howitworks 
> HTTP/1.1"

Here, all we know is that apache received the request immediately after haproxy,
at the same second :

> 68.174.74.20 - - [29/Jul/2010:20:20:57 -0400] "GET /new-york/howitworks 
> HTTP/1.1" 200 0 "http://scoutmob.com/new-york/deal"; "Mozilla/5.0 (Macintosh; 
> U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML, like Gecko) 
> Chrome/5.0.375.99 Safari/533.4"

But we don't know when it got the response. Also, have you noticed how the
response was empty ? (status 200, 0 bytes). Apache itself has at least been
able to log that there was a problem with the communication with the ruby
server.

Also, below we see that the client probably has hit the reload button, because
the referrer is the page itself this time :

> 68.174.74.20 - - [29/Jul/2010:20:21:30 -0400] "GET /new-york/howitworks
> HTTP/1.1" 200 5698 "http://scoutmob.com/new-york/howitworks"; "Mozilla/5.0
> (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML, like
> Gecko) Chrome/5.0.375.99 Safari/533.4"

The best thing to do when you can't trust the tools anymore is to start
tcpdump between apache and ruby. Capture using full packets (-s0) and
put that in a file (-w /tmp/file.cap). Then you can run ngrep on it to
search for the faulty IP. Find the port and look for one very long
connection. Probably that you'll notice something odd such as the ruby
server returning 200 without any content-length but a transfer-encoding
chunked instead, then closing the connection (or dying) after 1 minute,
which would result in apache returning what it got (200, 0 bytes), long
after haproxy closed.

Maybe that's something different, but it's pure guess, and a common pattern.

Regards,
Willy


Reply via email to