Re: Large difference between response time as measured by nginx and as measured by HAProxy

Tim Düsterhus , WoltLab GmbH Wed, 26 Aug 2020 08:40:11 -0700

Willy,

Am 26.08.20 um 16:59 schrieb Willy Tarreau:
> On Wed, Aug 26, 2020 at 12:59:55PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
>> Within the HAProxy logs we're sometimes seeing largish 'Tr' values for
>> static file requests. For example 734ms for the example request below.
> 
> That sounds huge. I could have imagined a 200ms time, possibly indicating
> that somewhere between the two, an extra MSG_MORE was present and prevented
> a request and/or response from leaving the machine. But 734ms doesn't
> correspond to anything usual.


There are even worse ones:

nginx:

> Aug 26 09:56:50 *snip* abd21141cdb4[5615]: 5F4631E22E407 "GET *snip*.gif 
> HTTP/2.0" 200 139 0.000

HAProxy:

> Aug 26 09:57:17 *snip* haproxy[3066]: *snip*:57086 [26/Aug/2020:09:56:50.265] 
> https~ *snip*/nginx 0/0/0/27340/27340 200 335 - - ---- 82/82/17/17/0 0/0 
> {*snip*|*snip*} {*snip*|} "GET *snip*.gif HTTP/2.0" 5F4631E22E407

Yes, that's 27340ms. I don't have any more details regarding the
connection reuse for that, unfortunately :-(

FWIW: I disabled HTTP/2 for the backend connection after sending my
initial email and I am not seeing anything worse than 45ms for static
files any more.

>> The two requests in nginx' log:
>>
>>> Aug 26 10:23:21 *snip* abd21141cdb4[5615]: 5F46381900A0 "GET *snip* 
>>> HTTP/2.0" 200 27049 0.070
>>> Aug 26 10:23:26 *snip* abd21141cdb4[5615]: 5F46381E00F5 "GET *snip*.gif 
>>> HTTP/2.0" 200 6341 0.000
>>
>> Note how the timing for the PHP request matches up what HAProxy was
>> seeing: Roughly 70ms / 0.070s. The timing for the GIF request is *way*
>> off, though.
> 
> Stupid question, do the PHP request *always* match ? I'm asking because

I just arbitrarily checked 3 PHP requests and they match. The static
ones *mostly* do as well. It's just that static files taking ages stick
out like a sore thumb, while PHP tends to be all over the place anyway.

It basically was a "grep for static file request" and then "halog -pct"
and wonder why there's a large number at the bottom.

> one point is that haproxy cannot tell the difference between the two.
> However the static one very likely responds instantly, which could justify
> a less common code path being taken and a bug to be triggered.
> 
> However static files might also cause some disk access, and not knowing
> anything about the environment, I'd say it's also possible that the whole
> nginx process gets frozen for the time it takes open() to respond, and if
> for whatever reason there are lots of I/Os and it takes that long, it might
> be possible that it doesn't notice it, depending where the total time is
> retrieved.

See above: HTTP/1 did not show the issue within the last 4 hours, so I
would rule out nginx / the disk being the culprit here.

Static files are served from fast local SSDs with plenty of page cache.
I don't expect the disk access to bottleneck up to the 3-digit
millisecond range.

> But it could definitely be a bug in the connection reuse or mux code, or
> course. It's just that I have no idea what could cause this for now.
> 
> If you see it often enough, it might be useful to enable h2 traces and
> watch them on the CLI, or even run a tcpdump between the two to observe
> what's going on. The traces might be easier as a first step as you'll
> see already decoded stuff. For example:
> 
>   (echo "trace h2 sink buf0; trace h2 level user; trace h2 verbosity \
>    complete; trace h2 start now;show events buf0 -w -n" ; cat ) | \
>   socat - /path/to/socket > /path/to/logs
> 
> Once you're confident you got it, you can disable all traces using the
> short form "trace 0".

As I said: It's not really reproducible. I'm mostly seeing it after the
fact. I can't even tell if HAProxy's measuring or nginx' measuring is
correct (i.e. which latency the client actually experienced).

However tcpdump definitely is not going to be fun, because it's all TLS
with PFS ciphers.

> Another thing you can try is to artificially limit
> tune.h2.max-concurrent-streams just in case there is contention in
> the server's connection buffers. By default it's 100, you can try with
> much less (e.g. 20) and see if it seems to make any difference at all.
> 

The fact that disabling HTTP/2 helps could indicate that something like
this is the case here. I'll try that tomorrow, thanks.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P

Re: Large difference between response time as measured by nginx and as measured by HAProxy

Reply via email to