Willy, Am 26.08.20 um 16:59 schrieb Willy Tarreau: > On Wed, Aug 26, 2020 at 12:59:55PM +0200, Tim Düsterhus, WoltLab GmbH wrote: >> Within the HAProxy logs we're sometimes seeing largish 'Tr' values for >> static file requests. For example 734ms for the example request below. > > That sounds huge. I could have imagined a 200ms time, possibly indicating > that somewhere between the two, an extra MSG_MORE was present and prevented > a request and/or response from leaving the machine. But 734ms doesn't > correspond to anything usual.
There are even worse ones: nginx: > Aug 26 09:56:50 *snip* abd21141cdb4[5615]: 5F4631E22E407 "GET *snip*.gif > HTTP/2.0" 200 139 0.000 HAProxy: > Aug 26 09:57:17 *snip* haproxy[3066]: *snip*:57086 [26/Aug/2020:09:56:50.265] > https~ *snip*/nginx 0/0/0/27340/27340 200 335 - - ---- 82/82/17/17/0 0/0 > {*snip*|*snip*} {*snip*|} "GET *snip*.gif HTTP/2.0" 5F4631E22E407 Yes, that's 27340ms. I don't have any more details regarding the connection reuse for that, unfortunately :-( FWIW: I disabled HTTP/2 for the backend connection after sending my initial email and I am not seeing anything worse than 45ms for static files any more. >> The two requests in nginx' log: >> >>> Aug 26 10:23:21 *snip* abd21141cdb4[5615]: 5F46381900A0 "GET *snip* >>> HTTP/2.0" 200 27049 0.070 >>> Aug 26 10:23:26 *snip* abd21141cdb4[5615]: 5F46381E00F5 "GET *snip*.gif >>> HTTP/2.0" 200 6341 0.000 >> >> Note how the timing for the PHP request matches up what HAProxy was >> seeing: Roughly 70ms / 0.070s. The timing for the GIF request is *way* >> off, though. > > Stupid question, do the PHP request *always* match ? I'm asking because I just arbitrarily checked 3 PHP requests and they match. The static ones *mostly* do as well. It's just that static files taking ages stick out like a sore thumb, while PHP tends to be all over the place anyway. It basically was a "grep for static file request" and then "halog -pct" and wonder why there's a large number at the bottom. > one point is that haproxy cannot tell the difference between the two. > However the static one very likely responds instantly, which could justify > a less common code path being taken and a bug to be triggered. > > However static files might also cause some disk access, and not knowing > anything about the environment, I'd say it's also possible that the whole > nginx process gets frozen for the time it takes open() to respond, and if > for whatever reason there are lots of I/Os and it takes that long, it might > be possible that it doesn't notice it, depending where the total time is > retrieved. See above: HTTP/1 did not show the issue within the last 4 hours, so I would rule out nginx / the disk being the culprit here. Static files are served from fast local SSDs with plenty of page cache. I don't expect the disk access to bottleneck up to the 3-digit millisecond range. > But it could definitely be a bug in the connection reuse or mux code, or > course. It's just that I have no idea what could cause this for now. > > If you see it often enough, it might be useful to enable h2 traces and > watch them on the CLI, or even run a tcpdump between the two to observe > what's going on. The traces might be easier as a first step as you'll > see already decoded stuff. For example: > > (echo "trace h2 sink buf0; trace h2 level user; trace h2 verbosity \ > complete; trace h2 start now;show events buf0 -w -n" ; cat ) | \ > socat - /path/to/socket > /path/to/logs > > Once you're confident you got it, you can disable all traces using the > short form "trace 0". As I said: It's not really reproducible. I'm mostly seeing it after the fact. I can't even tell if HAProxy's measuring or nginx' measuring is correct (i.e. which latency the client actually experienced). However tcpdump definitely is not going to be fun, because it's all TLS with PFS ciphers. > Another thing you can try is to artificially limit > tune.h2.max-concurrent-streams just in case there is contention in > the server's connection buffers. By default it's 100, you can try with > much less (e.g. 20) and see if it seems to make any difference at all. > The fact that disabling HTTP/2 helps could indicate that something like this is the case here. I'll try that tomorrow, thanks. Best regards Tim Düsterhus Developer WoltLab GmbH -- WoltLab GmbH Nedlitzer Str. 27B 14469 Potsdam Tel.: +49 331 96784338 duester...@woltlab.com www.woltlab.com Managing director: Marcel Werk AG Potsdam HRB 26795 P