Re: Large difference between response time as measured by nginx and as measured by HAProxy

Willy Tarreau Wed, 26 Aug 2020 08:53:11 -0700

On Wed, Aug 26, 2020 at 05:38:52PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> There are even worse ones:
> 
> nginx:
> 
> > Aug 26 09:56:50 *snip* abd21141cdb4[5615]: 5F4631E22E407 "GET *snip*.gif 
> > HTTP/2.0" 200 139 0.000
> 
> HAProxy:
> 
> > Aug 26 09:57:17 *snip* haproxy[3066]: *snip*:57086 
> > [26/Aug/2020:09:56:50.265] https~ *snip*/nginx 0/0/0/27340/27340 200 335 - 
> > - ---- 82/82/17/17/0 0/0 {*snip*|*snip*} {*snip*|} "GET *snip*.gif 
> > HTTP/2.0" 5F4631E22E407
> 
> Yes, that's 27340ms. I don't have any more details regarding the
> connection reuse for that, unfortunately :-(


Ouch!

> FWIW: I disabled HTTP/2 for the backend connection after sending my
> initial email and I am not seeing anything worse than 45ms for static
> files any more.

OK, that already narrows the issue down a bit.

> I just arbitrarily checked 3 PHP requests and they match. The static
> ones *mostly* do as well. It's just that static files taking ages stick
> out like a sore thumb, while PHP tends to be all over the place anyway.
> It basically was a "grep for static file request" and then "halog -pct"
> and wonder why there's a large number at the bottom.

I can understand indeed.

> See above: HTTP/1 did not show the issue within the last 4 hours, so I
> would rule out nginx / the disk being the culprit here.

That's perfect, it definitely rules out that part indeed.

> Static files are served from fast local SSDs with plenty of page cache.
> I don't expect the disk access to bottleneck up to the 3-digit
> millisecond range.

OK. I've been hit in the past by machines swapping out while uploading
some logs over HTTP, and taking ages to even allocate network buffers
after that! They used to slowly recover during the day and start their
madness again next night. It was even causing packet losses causing
retransmissions! That's why I asked :-)

> As I said: It's not really reproducible.

Yeah but we all have a definition of "not really reproducible". As you
consider that it doesn't happen in HTTP/1 after only 4 hours, for me
this means you expect it to usually happen at least once in 4 hours.
If the load is not too high (no more than a few thousands requests per
second) and you have some disk space, the h2 trace method could prove
to be useful once you have isolated a culprit in the logs. You could
even gzip the output on the fly to take less space, they compress very
well.

> I'm mostly seeing it after the
> fact. I can't even tell if HAProxy's measuring or nginx' measuring is
> correct (i.e. which latency the client actually experienced).

Unless there's a bug causing haproxy to report a random value (or a timer
related to the previous request for example and would explain the large
values), I'd be tempted to say that it's likely the client experiences
it :-/

> However tcpdump definitely is not going to be fun, because it's all TLS
> with PFS ciphers.

That's what I was suspecting, sadly, but preferred to ask.

> > Another thing you can try is to artificially limit
> > tune.h2.max-concurrent-streams just in case there is contention in
> > the server's connection buffers. By default it's 100, you can try with
> > much less (e.g. 20) and see if it seems to make any difference at all.
> > 
> 
> The fact that disabling HTTP/2 helps could indicate that something like
> this is the case here. I'll try that tomorrow, thanks.

You're welcome!

Willy

Re: Large difference between response time as measured by nginx and as measured by HAProxy

Reply via email to