Hi Willy,

> As you like. My first rule is never to make people take risks they're not
> willing to take. It's perfectly OK to me if you don't feel confident with
> 2.0-dev in prod. I'm going to perform the 1.9 backports. If you're
> interested in testing them from the branch before I release it today, just
> let me know.

I just pulled, compiled, and tested the newly minted 1.9.3, and I'm 
experiencing the same issue with alpn h2 on the backend definition. I also 
strongly suspect it's not related to maximum streams per connection, because 
the issue happens well before 1000 requests (and consistently at that).

> Perhaps the client causing the issues was a red herring for
> the server-side bugs.

I believe after the fixes in 1.9.3, this has actually been proven false. I can 
replicate this bug every single time with the following:

1. Make set of requests
2. Cancel all or subset of requests
3. Make another set of requests

On step 3, every single request fails because something is getting messed up by 
2, causing the server stream to go away. The log lines are the same pattern of 
C(C|D)-- / SD--.

Another piece of information, when this happens, Chrome drops this in the 
console, which always correlates to a SD-- line in the haproxy logs:

Failed to load resource: net::ERR_SPDY_PROTOCOL_ERROR

I also just verified this happens under similar circumstances using alpn 
http/1.1 on the backend (this may or may not be new in 1.9.3). 4 requests 
failed on the client side with the following error messages after using the 
same 3 step process (all correlate to a CD-- message in the logs):

net::ERR_SPDY_PROTOCOL_ERROR
net::ERR_CONNECTION_CLOSED 200
net::ERR_CONNECTION_CLOSED 200
net::ERR_CONNECTION_CLOSED 200

I wonder if HAProxy is interpreting a broken request as a client error and 
going away (but not sending GOAWAY)? I don't know enough about h2 to know if 
this is in the spec or not, but perhaps that's another avenue of investigation?

I'm more than happy to help, and while my C is a bit rusty, I'm starting to get 
a feel for the HAProxy source, so I could attempt to debug as well, if you have 
any suggestions in that vein.

Best,
Luke

—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, January 25, 2019 9:48 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Luke,
> 

> On Fri, Jan 25, 2019 at 08:08:22AM +0000, Luke Seelenbinder wrote:
> 

> > Hi Willy,
> > 

> > > OK so instead of sending you a boring series, I can propose you to run
> > > a test on 2.0-dev, which contains all the fixes I had to go through
> > > because of tiny issues everywhere related to this. If you're using git,
> > > just clone the master and checkout commit f7a259d46f8.
> > > you can simply wait for the next nightly snapshot.
> > 

> > Sounds good. My compilation playbook uses tarballs, so I'll just use the 
> > last
> > nightly. I assume I should wait for these fixes to be backported (1.9.3?)
> > before trying anything in production?
> 

> As you like. My first rule is never to make people take risks they're not
> willing to take. It's perfectly OK to me if you don't feel confident with
> 2.0-dev in prod. I'm going to perform the 1.9 backports. If you're
> interested in testing them from the branch before I release it today, just
> let me know.
> 

> > > But now you have a new server parameter called
> > > "max-reuse". This allows to limit the number of times a server connection
> > > is reused. For example you can set it to 990 when you know that the
> > > server limits to 1000.
> > 

> > That's great! I didn't expect to get a new configuration option. I'll
> > definitely make sure these are in sync across our infrastructure.
> 

> Even without the option it will work better than before, but the option
> is there to completely void any risk of hitting the limit too late.
> 

> > > Regarding the fact that in your case the client's close seems to cause
> > > the server-side issue, I couldn't yet reproduce it though I have a few
> > > theories about it. One of them would be an unexpected response from
> > > the server causing the connection to turn to an error state. The other
> > > one would be that we'd incorrectly abort our stream and/or session and
> > > bring the connection down with us. I'll submit these theories to Olivier
> > > once he's back so that he can tell me I'm saying crap regarding some of
> > > them and we can focus on what remains :-)
> > 

> > Sounds good. I'll report back my results from the latest snapshot and we can
> > go from there. Perhaps the client causing the issues was a red herring for
> > the server-side bugs.
> 

> I hadn't thought about it but it could also be, indeed.
> 

> > Thanks again for deep-diving and resolving this! I won't ask how many hours
> > it took to find all these small edge cases. . .
> 

> Usually you start from a bug report, you find a hook in the code which
> starts to explain it, and you walk along the thread discovering that a
> lot of places are wrong together and once perfectly aligned cause crazy
> things to happen. Of course there's the solution of putting some brown
> paper bag on top of the most visible one, but in this project we prefer
> to address the causes than the consequences ;-) So yes it sometimes
> takes time and caffeine, and often delays releases because it's always
> hard to accept to release something with known unfixed issues in it.
> 

> Cheers,
> Willy

Attachment: publickey - luke.seelenbinder@stadiamaps.com - 0xB23C1E8A.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to