Willy,

You rock.  Thanks for the help.

Good news is that it shows that apache is genuinely taking more then 50
seconds to respond (see my example below).  What's interesting is that when
I look at the ruby production log... I see about a convenient 50 second gap.

That said, it's clear that it's a communication issue between Apache and
Ruby.  Definitely not an issue with HAProxy.

I'll dig in and ferret out the issue, but I wanted to close the loop.

Thanks again,
PhilD


--- Log Entries ---

LB01 - HAProxy Error Log (/var/log/haproxy.err):


Jul 31 19:10:31 localhost haproxy[17404]: 72.148.201.189:50318
[31/Jul/2010:19:09:41.950]
http_proxy primary_farm/WS01 10/0/0/-1/50014 504 194 - - sH-- 0/0/0/0/0 0/0
"GET
/tidbits.json?API=5Tb44zB-OfBWAyIQYOIO&version=2.5&platform_id=631daac59543ed898fb00713ccc4523377af38ff&city_short_name=atlanta&state=GA
HTTP/1.1"



WS01 - Apache Log (/var/log/apache2/access.log):

72.148.201.189 - - [31/Jul/2010:19:09:41 -0400] "GET
/tidbits.json?API=5Tb44zB-OfBWAyIQYOIO&version=2.5&platform_id=631daac59543ed898fb00713ccc4523377af38ff&city_short_name=atlanta&state=GA
HTTP/1.1" 200 0 "-" "Appcelerator Titanium/1.3.3 (iPhone/4.0.1; iPhone OS;
en_US;)" *50042434 *



WS01-VanSkeevy - Request-Log-Analyzer - Ruby Production Log:

Jul 31 19:09:41 WS01 rails[23423]: Completed in 111ms (View: 74, DB: 29) |
200 OK [
http://api.scoutmob.com/unexpired/f2ad0a459da29b4d83c86e8134a2315ff2364a884a27e26dee74f374704c0335.json?API=5Tb44zB-OfBWAyIQYOIO&version=2.5&platform_id=9f1f94e32ff79c816549263979cc2fccc7c655da&city_short_name=atlanta&state=GA
]

Jul 31 19:10:31 WS01 rails[23423]: Processing DealsController#unexpired to
json (for 72.148.201.189 at 2010-07-31 19:10:31) [GET]


PhilD

<http://www.scoutmob.com/>*Phil Dupont*
p: 404-273-7079
e: [email protected]

In walking, just walk. In sitting, just sit. Above all, don't wobble.
-Yun-Men




On Fri, Jul 30, 2010 at 5:28 PM, Phil Dupont <[email protected]> wrote:

> Thanks Willy.  When things quiet down this weekend I'll make some changes
> to capture the data in the logs and see what that turns up.
>
> PhilD
>
>
> On Fri, Jul 30, 2010 at 5:07 PM, Willy Tarreau <[email protected]> wrote:
>
>> On Fri, Jul 30, 2010 at 04:33:16PM -0400, Phil Dupont wrote:
>> > I guess that's my point... I don't know where else to check.  There's
>> > nothing between the HAProxy box and the webservers but a switch.
>>  Certainly
>> > nothing I can get logs from.
>> >
>> > If I drill into an example, I end up with nothing.  See below for the
>> logs
>> > of one scenario I tried to trace.  But essentially:
>> >
>> >    - I see an HAProxy 504 error for someone trying to load
>> >    http://scoutmob.com/new-york/howitworks
>> >    - Then I check the apache access logs (not the error logs because the
>> >    error logs don't contain the user's IP addess, just the proxy's IP
>> address)
>> >    and see the connection was successful (passed a 200)
>>
>> Unfortunately there aren't any timer in your apache logs. What you see is
>> that apache correctly received the request and correctly responded. But
>> you don't know how much time it took for that. If it takes more than 50s
>> you have configured on your haproxy as the server timeout, the response
>> will never get back to the client. From the doc below, it seems it's "%D"
>> that you need to add to CustomLog to add the timer in the logs :
>>
>>    http://httpd.apache.org/docs/2.2/mod/mod_log_config.html
>>
>> >    - Then I run request-log-analyzer against the ruby log and see that
>> the
>> >    method that creates the /new-york/howitworks takes a max of .01s to
>> load.
>>
>> Most often when you see a timeout between two levels of proxies and a ruby
>> server, it's because it took ages to the ruby server to accept the request
>> (for whatever reasons, starting with mongrel's 1-connection limit). Thus,
>> while the second proxy waits for the connection to establish, the first
>> one
>> waits for the second proxy to respond, as it does not know it can't even
>> connect. Sometimes even the second proxy can connect but its request
>> remains
>> for ages in a queue, waiting for being dispatched. Your ruby logs just
>> indicate it took 10ms to process the request once it got there. There's a
>> similar problem with some Java-based application servers that make use of
>> thread pools with a front dispatched using a deep queue.
>>
>> But if you look below, there's no doubt it really took 50 seconds to
>> timeout :
>>
>> > Jul 29 20:21:47 localhost haproxy[17405]: 
>> > 68.174.74.20:55350[29/Jul/2010:20:20:57.478]
>> http_proxy primary_farm/WS01 4/0/1/-1/50006 504 194 - - sH-- 14/14/14/13/0
>> 0/0 "GET /new-york/howitworks HTTP/1.1"
>>
>> Here, all we know is that apache received the request immediately after
>> haproxy,
>> at the same second :
>>
>> > 68.174.74.20 - - [29/Jul/2010:20:20:57 -0400] "GET /new-york/howitworks
>> HTTP/1.1" 200 0 "http://scoutmob.com/new-york/deal"; "Mozilla/5.0
>> (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML, like
>> Gecko) Chrome/5.0.375.99 Safari/533.4"
>>
>> But we don't know when it got the response. Also, have you noticed how the
>> response was empty ? (status 200, 0 bytes). Apache itself has at least
>> been
>> able to log that there was a problem with the communication with the ruby
>> server.
>>
>> Also, below we see that the client probably has hit the reload button,
>> because
>> the referrer is the page itself this time :
>>
>> > 68.174.74.20 - - [29/Jul/2010:20:21:30 -0400] "GET /new-york/howitworks
>> > HTTP/1.1" 200 5698 "http://scoutmob.com/new-york/howitworks";
>> "Mozilla/5.0
>> > (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/533.4 (KHTML,
>> like
>> > Gecko) Chrome/5.0.375.99 Safari/533.4"
>>
>> The best thing to do when you can't trust the tools anymore is to start
>> tcpdump between apache and ruby. Capture using full packets (-s0) and
>> put that in a file (-w /tmp/file.cap). Then you can run ngrep on it to
>> search for the faulty IP. Find the port and look for one very long
>> connection. Probably that you'll notice something odd such as the ruby
>> server returning 200 without any content-length but a transfer-encoding
>> chunked instead, then closing the connection (or dying) after 1 minute,
>> which would result in apache returning what it got (200, 0 bytes), long
>> after haproxy closed.
>>
>> Maybe that's something different, but it's pure guess, and a common
>> pattern.
>>
>> Regards,
>> Willy
>>
>>
>

Reply via email to