Hi Willy,

I've done a good bit more digging. Here's my notes so far.

1. On occasion the old behavior (CD-- then SD--) still exhibits. It's a bit too 
consistent for me to push to our production platform yet.
2. It seems Chrome and Firefox's behavior is different enough to change the 
frequency of this being a problem. I consistently experience errors on Chrome 
(resulting in net::ERR_SPDY_PROTOCOL_ERROR or CONNECTION_ERROR), but I only 
occasionally have similar issues on Firefox. Safari is a non-issue, because it 
doesn't cancel in-flight requests at all.
3. Going directly against NGINX exhibits fault-tolerant behavior in all 
browsers. I see a smattering of 499 (client went away) and 401 (unauthorized, 
but I'm not sure why?) errors, but never valid requests erroring. It also 
reuses the same connection ID across all requests.
4. The patch definitely improved the situation. I hit an error every 4-5 double 
zoom cycles, and that's usually only 1-2 requests of 10.
5. When an error occurs, Chrome definitely renegotiates the connection, as one 
would expect.
6. Firefox seems less problematic than Chrome. I get an error maybe every 10 
zoom cycles, but that usually affects the whole connection. (Undoubtedly due to 
differences in how Firefox and Chrome handle errors.)
7. The 403 errors previously experienced are definitely do to our 
authentication / authorization ACLs, the requests are malformed or 
misrepresented well enough to not pass the ACLs. Maybe I should return a 400 
instead of 403 in my previously mentioned ACL? Removing all of our ACLs removes 
all 403s and PR-- and turned the response code into -1.
8. Not all CD-- / SD-- pairs result in client-side failed requests. (This is a 
definite improvement as well.)
9. Most CD-- / SD-- lines are independent now. That is, some requests from a 
set fail with CD-- and some with SD--, but CD-- responses don't seem to cause 
SD-- anymore.

I don't know if any of this rings any bells, but that's what I've been able to 
determine so far. Overall the patch improves things significantly, but some 
latent issues are still around.

I would be happy to continue to assist however I can. :)

—
Luke Seelenbinder
Stadia Maps | Founder
stadiamaps.com

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, January 30, 2019 10:32 AM, Luke Seelenbinder 
<[email protected]> wrote:

> Hi Willy,
> 

> > 403 is strange, it's forbidden. 400 I could understand. What might happen
> > is the following scenario :
> 

> > I'm really not thinking of any case resulting in this but I'll take a
> > second look.
> 

> This…is actually a configuration issue on my side, I think. Sorry for the 
> wild-goose chase. I have the following lines:
> 

> acl stadia_request hdr_dom(host) -i stadiamaps.com
> http-request deny unless stadia_request
> 

> I can well imagine this causing 403s instead of 400s on canceled requests.
> 

> > So Chrome likely sends RST_STREAM immediately while Firefox silently
> > cancels the stream and responds with RST_STREAM when it gets the response
> > back.
> 

> Even more fun: Safari doesn't actually cancel the request. It just happily 
> processes whatever it gets back. So the three major browsers process it all 
> in three different ways. Fun.
> 

> > Don't get me wrong, I'm extremely careful about killing streams only
> > as much as possible. There's even an h2spec failure due to this, where
> > it expects a connection error, but where doing so would result in too
> > many streams being caught causing this to happen. So as long as we can
> > identify the exact sequence which causes this, I'm obviously interested
> > in studying it!
> 

> Okay—great! I will keep digging to see if I can consistently replicate what 
> I'm seeing. I suspect this may be a side-effect of the http-request deny acl 
> above.
> 

> I will remove that acl and test without it to see if that improves the 
> situation, if so, I'll come up with a better way to implement this check.
> 

> Best,
> Luke
> 

> —
> Luke Seelenbinder
> Stadia Maps | Founder
> stadiamaps.com
> 

> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, January 30, 2019 10:21 AM, Willy Tarreau [email protected] wrote:
> 

> > Hi Luke,
> 

> > On Wed, Jan 30, 2019 at 08:41:03AM +0000, Luke Seelenbinder wrote:
> 

> > > It works! I'm seeing very, very few CD-- -> SD-- chains now. I did see a 
> > > few
> > > in h2<->h2 mode, but precious few, so I'm very happy to say the bug as
> > > previous manifested is remedied! Thanks for digging so wide and deep for 
> > > this
> > > bug fix--kinda funny how it ended up being half an if condition on one 
> > > line.
> > > . . software development is the best, ain't it? :-)
> 

> > Excellent, thanks for the report.
> 

> > > However, that leads to a second tier of less-important questions:
> 

> > > 1.  Should RST_STREAM (I believe that's what happens on request 
> > > cancellation)
> > >     also reset the full h2 connection?
> > >     

> 

> > 

> 

> > It should never happen, unless of course the frame is invalid protocol-wise
> > and results in a protocol error, but in practice it should not happen when
> > coming from a browser. Thus I suspect we have another case of GOAWAY.
> 

> > > I'm seeing the following behavior:
> 

> > > -   Ensure connection established. Zoom twice. Set X is canceled. Set Y
> > >     succeeds, but the first request from Y must renegotiate the 
> > > connection & TLS,
> > >     which definitely results in a performance hit on our side.
> > >     

> 

> > > 

> 

> > > 2.  Sometimes a canceled request results in a 403 because the request is
> > >     malformed (PR-- log line); this makes sense in the usual case, but in 
> > > this
> > >     specific case, I'm not so sure.
> > >     

> 

> > 

> 

> > 403 is strange, it's forbidden. 400 I could understand. What might happen
> > is the following scenario :
> 

> > -   the client sends the request
> 

> > -   the H2 demux decodes it, creates an H2 stream, then creates an
> >     application-layer stream, attaches them and wakes the latter up
> >     

> 

> > -   the client sends RST_STREAM
> 

> > -   the demux places an error flag on the stream (it was aborted)
> 

> > -   the stream is scheduled, reads the decoded request to process it,
> >     and finds a request plus an error
> >     -> 400 bad request could be logged to indicate how this stream was
> >     aborted. It's not much different from what could happen in HTTP/1
> >     if the client had aborted the connection just before it was started
> >     to be parsed.
> >     

> 

> > 

> 

> > > This would also result in a broken connection
> > > on the client side, I'm thinking.
> 

> > It should not, but as you probably know, everything lies in the difference
> > between "should" and "can".
> 

> > > Would there be a way to handle RST_STREAM
> > > different without compromising security reasons for rejecting malformed /
> > > broken requests?
> 

> > I'm really not thinking of any case resulting in this but I'll take a
> > second look.
> 

> > > Side note: this seems to be a behavior difference between Firefox and 
> > > Chrome.
> > > In Chrome I always get PR-- lines. In Firefox I almost always get C(C|D)--
> > > lines.
> 

> > So Chrome likely sends RST_STREAM immediately while Firefox silently
> > cancels the stream and responds with RST_STREAM when it gets the response
> > back.
> 

> > > Would it be possible to remove the tie between stream failure and 
> > > connection
> > > failures on the client side? That is, would it be possible to allow
> > > individual stream failures, such is the case with client-side request
> > > cancellation, while maintaining the integrity of the whole h2 client
> > > connection? I know this is delving pretty deep into the nuances of h2 vs
> > > http/1.1 and resulting issues and may not even be a server-side solution, 
> > > but
> > > I figured it's worth asking while the code is already in your head. :-)
> 

> > Don't get me wrong, I'm extremely careful about killing streams only
> > as much as possible. There's even an h2spec failure due to this, where
> > it expects a connection error, but where doing so would result in too
> > many streams being caught causing this to happen. So as long as we can
> > identify the exact sequence which causes this, I'm obviously interested
> > in studying it!
> 

> > > Thanks again for this fix! I will roll it out to our servers today after a
> > > bit more testing, since the incidence of bad results is negligible and 
> > > this
> > > definitely should improve our e2e timings.
> 

> > Great, looking forward to hearing about your new findings!
> 

> > Thanks,
> > Willy

Attachment: publickey - [email protected] - 0xB23C1E8A.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to