Hi Willy, Unfortunately, I spoke too soon in my last email. After hitting send, I went down the rabbit hole again and uncovered some behaviors I thought we'd rooted out. Namely, any time I use HTX mode with an H2 fe -> H1 or H2 backend and have frequent request cancellation as discussed previously, I'm seeing hung requests. It's not every request nor is it every cycle of requests, but I'd say at least 10% of requests end up hanging indefinitely until they eventually timeout according to HAProxy. (So perhaps this is an indicator itself of what might be wrong?) HAProxy reports retries / redispatches and maxes out the timeouts then the request dies. Here's two example log lines, the second one I killed the request myself:
[18/Mar/2019:15:02:49.723] stadiamaps~ tile/tile1 0/37204/-1/-1/49606 503 0 - - sC-- 2/1/2/2/3 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0" [18/Mar/2019:15:03:39.507] stadiamaps~ tile/tile1 0/24804/-1/-1/29123 503 0 - - CC-- 2/1/0/0/2 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0" If I disable HTX, everything flows per normal and the status codes are even correctly -1. I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The global "this will work" and "this will not work" switch is HTX mode. Anytime it's enabled, I see bad behavior. Anytime it's disabled, I see flawless behavior. Any thoughts? I've tried this with and without http-reuse, abortonclose, various settings for pool-purge-delay. Best, Luke — Luke Seelenbinder Stadia Maps | Founder stadiamaps.com On Mon, Mar 18, 2019, at 13:46, Luke Seelenbinder wrote: > Hi Willy, > > I finally had the opportunity to try out `option abortonclose`. > > Initially, it made the problem much worse. Instead of occasionally > incorrect status codes in the logs, I saw requests fail in the > following manner: > > [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 > 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png > HTTP/2.0" > [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 > 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png > HTTP/2.0" > > What's further interesting, it is was consistently 2 out of 18 > requests. That led me down the road of checking queue timeouts > (noticing the timing correlation in the logs). I adjusted `timeout > connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s. > > After adjusting those timeouts and pool purges and re-enabling > `abortonclose`, the request errors I was seeing magically went away. > I'll push this config to production and see if we see a reduction in > 503s. I also suspect we'll see a marginal improvement in throughput and > response time due to keeping backend connections open longer. > > I'll also keep an eye our for inconsistencies between our backend > accept capability and timeouts and see if perhaps we're overrunning > some buffer somewhere in HAProxy, NGINX, or somewhere else. > > Thanks for your help so far! > > Best, > Luke > > — > Luke Seelenbinder > Stadia Maps | Founder > stadiamaps.com > > On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote: > > On Mon, Mar 04, 2019 at 11:45:53AM +0000, Luke Seelenbinder wrote: > > > Hi Willy, > > > > > > > Do you have "option abortonclose" in your config ? > > > > > > We do not have abortonclose. Do you recommend this if we have a lot of > > > client-side request aborts (but not connection level closes)? From reading > > > the docs, I came away conflicted as to the implications. :-) > > > > It will help, especially if you have maxconn configured on your server > > lines, as it will allow the requests to be aborted while still in queue. > > > > That said, we still don't know exactly what causes your logs. > > > > Willy > > > >