Hi Willy,

Unfortunately, I spoke too soon in my last email. After hitting send, I went 
down the rabbit hole again and uncovered some behaviors I thought we'd rooted 
out. Namely, any time I use HTX mode with an H2 fe -> H1 or H2 backend and have 
frequent request cancellation as discussed previously, I'm seeing hung 
requests. It's not every request nor is it every cycle of requests, but I'd say 
at least 10% of requests end up hanging indefinitely until they eventually 
timeout according to HAProxy. (So perhaps this is an indicator itself of what 
might be wrong?) HAProxy reports retries / redispatches and maxes out the 
timeouts then the request dies. Here's two example log lines, the second one I 
killed the request myself:

[18/Mar/2019:15:02:49.723] stadiamaps~ tile/tile1 0/37204/-1/-1/49606 503 0 - - 
sC-- 2/1/2/2/3 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0"
[18/Mar/2019:15:03:39.507] stadiamaps~ tile/tile1 0/24804/-1/-1/29123 503 0 - - 
CC-- 2/1/0/0/2 0/0 {} "GET /tiles/osm_bright/10/565/3...@2x.png HTTP/2.0"

If I disable HTX, everything flows per normal and the status codes are even 
correctly -1.

I've replicated this on 1.9.4, 1.9.x master, and 2.0-dev master branches. The 
global "this will work" and "this will not work" switch is HTX mode. Anytime 
it's enabled, I see bad behavior. Anytime it's disabled, I see flawless 
behavior.

Any thoughts? I've tried this with and without http-reuse, abortonclose, 
various settings for pool-purge-delay.

Best,
Luke

—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com

On Mon, Mar 18, 2019, at 13:46, Luke Seelenbinder wrote:
> Hi Willy,
> 
> I finally had the opportunity to try out `option abortonclose`.
> 
> Initially, it made the problem much worse. Instead of occasionally 
> incorrect status codes in the logs, I saw requests fail in the 
> following manner:
> 
> [18/Mar/2019:12:30:08.040] stadiamaps~ tile/tile1 0/18603/-1/-1/24804 
> 503 0 - - sC-- 2/1/1/1/3 0/0 {} "GET /tiles/osm_bright/6/31/20.png 
> HTTP/2.0"
> [18/Mar/2019:12:30:08.041] stadiamaps~ tile/tile1 0/18602/-1/-1/24803 
> 503 0 - - sC-- 2/1/0/0/3 0/0 {} "GET /tiles/osm_bright/6/34/20.png 
> HTTP/2.0"
> 
> What's further interesting, it is was consistently 2 out of 18 
> requests. That led me down the road of checking queue timeouts 
> (noticing the timing correlation in the logs). I adjusted `timeout 
> connect` up from 6200ms to 12400ms and added pool-purge-delay to 60s.
> 
> After adjusting those timeouts and pool purges and re-enabling 
> `abortonclose`, the request errors I was seeing magically went away. 
> I'll push this config to production and see if we see a reduction in 
> 503s. I also suspect we'll see a marginal improvement in throughput and 
> response time due to keeping backend connections open longer.
> 
> I'll also keep an eye our for inconsistencies between our backend 
> accept capability and timeouts and see if perhaps we're overrunning 
> some buffer somewhere in HAProxy, NGINX, or somewhere else.
> 
> Thanks for your help so far!
> 
> Best,
> Luke
> 
> —
> Luke Seelenbinder
> Stadia Maps | Founder
> stadiamaps.com
> 
> On Mon, Mar 4, 2019, at 14:08, Willy Tarreau wrote:
> > On Mon, Mar 04, 2019 at 11:45:53AM +0000, Luke Seelenbinder wrote:
> > > Hi Willy,
> > >
> > > > Do you have "option abortonclose" in your config ?
> > >
> > > We do not have abortonclose. Do you recommend this if we have a lot of
> > > client-side request aborts (but not connection level closes)? From reading
> > > the docs, I came away conflicted as to the implications. :-)
> > 
> > It will help, especially if you have maxconn configured on your server
> > lines, as it will allow the requests to be aborted while still in queue.
> > 
> > That said, we still don't know exactly what causes your logs.
> > 
> > Willy
> >
> 
>

Reply via email to