Thanks Willy. When things quiet down this weekend I'll make some changes to capture the data in the logs and see what that turns up.
PhilD On Fri, Jul 30, 2010 at 5:07 PM, Willy Tarreau <[email protected]> wrote: > On Fri, Jul 30, 2010 at 04:33:16PM -0400, Phil Dupont wrote: > > I guess that's my point... I don't know where else to check. There's > > nothing between the HAProxy box and the webservers but a switch. > Certainly > > nothing I can get logs from. > > > > If I drill into an example, I end up with nothing. See below for the > logs > > of one scenario I tried to trace. But essentially: > > > > - I see an HAProxy 504 error for someone trying to load > > http://scoutmob.com/new-york/howitworks > > - Then I check the apache access logs (not the error logs because the > > error logs don't contain the user's IP addess, just the proxy's IP > address) > > and see the connection was successful (passed a 200) > > Unfortunately there aren't any timer in your apache logs. What you see is > that apache correctly received the request and correctly responded. But > you don't know how much time it took for that. If it takes more than 50s > you have configured on your haproxy as the server timeout, the response > will never get back to the client. From the doc below, it seems it's "%D" > that you need to add to CustomLog to add the timer in the logs : > > http://httpd.apache.org/docs/2.2/mod/mod_log_config.html > > > - Then I run request-log-analyzer against the ruby log and see that > the > > method that creates the /new-york/howitworks takes a max of .01s to > load. > > Most often when you see a timeout between two levels of proxies and a ruby > server, it's because it took ages to the ruby server to accept the request > (for whatever reasons, starting with mongrel's 1-connection limit). Thus, > while the second proxy waits for the connection to establish, the first one > waits for the second proxy to respond, as it does not know it can't even > connect. Sometimes even the second proxy can connect but its request > remains > for ages in a queue, waiting for being dispatched. Your ruby logs just > indicate it took 10ms to process the request once it got there. There's a > similar problem with some Java-based application servers that make use of > thread pools with a front dispatched using a deep queue. > > But if you look below, there's no doubt it really took 50 seconds to > timeout : > > > Jul 29 20:21:47 localhost haproxy[17405]: > > 68.174.74.20:55350[29/Jul/2010:20:20:57.478] > http_proxy primary_farm/WS01 4/0/1/-1/50006 504 194 - - sH-- 14/14/14/13/0 > 0/0 "GET /new-york/howitworks HTTP/1.1" > > Here, all we know is that apache received the request immediately after > haproxy, > at the same second : > > > 68.174.74.20 - - [29/Jul/2010:20:20:57 -0400] "GET /new-york/howitworks > HTTP/1.1" 200 0 "http://scoutmob.com/new-york/deal" "Mozilla/5.0 > (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML, like > Gecko) Chrome/5.0.375.99 Safari/533.4" > > But we don't know when it got the response. Also, have you noticed how the > response was empty ? (status 200, 0 bytes). Apache itself has at least been > able to log that there was a problem with the communication with the ruby > server. > > Also, below we see that the client probably has hit the reload button, > because > the referrer is the page itself this time : > > > 68.174.74.20 - - [29/Jul/2010:20:21:30 -0400] "GET /new-york/howitworks > > HTTP/1.1" 200 5698 "http://scoutmob.com/new-york/howitworks" > "Mozilla/5.0 > > (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML, > like > > Gecko) Chrome/5.0.375.99 Safari/533.4" > > The best thing to do when you can't trust the tools anymore is to start > tcpdump between apache and ruby. Capture using full packets (-s0) and > put that in a file (-w /tmp/file.cap). Then you can run ngrep on it to > search for the faulty IP. Find the port and look for one very long > connection. Probably that you'll notice something odd such as the ruby > server returning 200 without any content-length but a transfer-encoding > chunked instead, then closing the connection (or dying) after 1 minute, > which would result in apache returning what it got (200, 0 bytes), long > after haproxy closed. > > Maybe that's something different, but it's pure guess, and a common > pattern. > > Regards, > Willy > >

