Hi Willy,
Unfortunately, I spoke too soon in my last email. After hitting send, I went
down the rabbit hole again and uncovered some behaviors I thought we'd rooted
out. Namely, any time I use HTX mode with an H2 fe -> H1 or H2 backend and have
frequent request cancellation as discussed previously, I'm seeing hung
requests. It's not every request nor is it every cycle of requests, but I'd say
at least 10% of requests end up hanging indefinitely until they eventually
timeout according to HAProxy. (So perhaps this is an indicator itself of what
might be wrong?) HAProxy reports retries / redispatches and maxes out the
timeouts then the request dies. Here's two example log lines, the second one I
killed the request myself:
[18/Mar/2019:15:02:49.723] stadiamaps~ tile/tile1 0/37204/-1/-1/49606 503 0 - -
sC-- 2/1/2/2/3 0/0 {} "GET /tiles/osm_bright/10/565/[email protected] HTTP/2.0"
[18/Mar/2019:15:03:39.507] stadiamaps~ tile/tile1 0/24804/-1/-1/29123 503 0 - -
CC-- 2/1/0/0/2 0/0 {} "GET /tiles/osm_bright/10/565/[email protected] HTTP/2.0"
If I disable HTX, everything flows per normal and the status codes are even
correctly -1.
I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The
global "this will work" and "this will not work" switch is HTX mode. Anytime
it's enabled, I see bad behavior. Anytime it's disabled, I see flawless
behavior.
Any thoughts? I've tried this with and without http-reuse, abortonclose,
various settings for pool-purge-delay.
Best,
Luke
—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com
On Mon, Mar 18, 2019, at 13:46, Luke Seelenbinder wrote:
> Hi Willy,
>
> I finally had the opportunity to try out `option abortonclose`.
>
> Initially, it made the problem much worse. Instead of occasionally
> incorrect status codes in the logs, I saw requests fail in the
> following manner:
>
> [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804
> 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png
> HTTP/2.0"
> [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803
> 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png
> HTTP/2.0"
>
> What's further interesting, it is was consistently 2 out of 18
> requests. That led me down the road of checking queue timeouts
> (noticing the timing correlation in the logs). I adjusted `timeout
> connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s.
>
> After adjusting those timeouts and pool purges and re-enabling
> `abortonclose`, the request errors I was seeing magically went away.
> I'll push this config to production and see if we see a reduction in
> 503s. I also suspect we'll see a marginal improvement in throughput and
> response time due to keeping backend connections open longer.
>
> I'll also keep an eye our for inconsistencies between our backend
> accept capability and timeouts and see if perhaps we're overrunning
> some buffer somewhere in HAProxy, NGINX, or somewhere else.
>
> Thanks for your help so far!
>
> Best,
> Luke
>
> —
> Luke Seelenbinder
> Stadia Maps | Founder
> stadiamaps.com
>
> On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote:
> > On Mon, Mar 04, 2019 at 11:45:53AM +0000, Luke Seelenbinder wrote:
> > > Hi Willy,
> > >
> > > > Do you have "option abortonclose" in your config ?
> > >
> > > We do not have abortonclose. Do you recommend this if we have a lot of
> > > client-side request aborts (but not connection level closes)? From reading
> > > the docs, I came away conflicted as to the implications. :-)
> >
> > It will help, especially if you have maxconn configured on your server
> > lines, as it will allow the requests to be aborted while still in queue.
> >
> > That said, we still don't know exactly what causes your logs.
> >
> > Willy
> >
>
>