Re: 1.5 badly dies after a few seconds

Cyril Bonté Tue, 21 Sep 2010 00:17:27 -0700

Le mardi 21 septembre 2010 07:38:53, Willy Tarreau a écrit :
> Hi Cyril,
> 
> On Tue, Sep 21, 2010 at 01:50:45AM +0200, Cyril Bonté wrote:
> > Hi Willy and Jozsef,
> > 
> > Le lundi 20 septembre 2010 23:42:44, R.Nagy József a écrit :
> > > > (...)
> > > > Very nice, now we know that the FD does not get corrupted, but when
> > > > haproxy wants to use it, it's already closed on the other side.
> > > > Probably that a TCP rule causes a reject that closes the connection
> > > > and that there is a possible return path that escapes from the
> > > > controlll, leading to frontend_accept() trying to continue to use
> > > > the closed FD.
> > > > 
> > > > I'll now try to find something in that spirit.
> > > > 
> > > > Also, it looks like only TCP is affected by this, because your
> > > > unix-stream connection worked like a charm, and used the same FD (7).
> > > > 
> > > > I don't see anything suspect in the traces, but they clearly help
> > > > eliminate wrong guesses.
> > > > 
> > > > Thanks Joe, I'll keep you informed if I find anything !
> > 
> > I don't know if it can help you but tonight I've installed a Freebsd in a
> > VM and could easily reproduce the issue.
> > Let me know if you want me to test some patches so that Jozsef doesn't
> > need to break his production traffic.
> > 
> > I'll try to find time next days to add some debugs to track the origin of
> > the issue.
> > 
> > As a last test (and quite late one in the night so I must stop here :-)),
> > at the "out_delete_cfd" label, if I replace "return -1" by "return 0", I
> > don't reproduce the issue.
> 
> That's interesting, because the "return -1" is here to take the error path,
> and can only be caught once you get a -1 at least once on the setsockopt().
> What I suspect is that sometimes we get a -1 here because the client has
> reset the connection just after it was accepted.


Yes, that's how I reproduce it : I introduced a 2 second delay before the call 
to setsockopt and wrote a small client that reset the connection after 1 
second.

> We then take the error
> path and we have something there which incorrectly unrolls all that was
> done. I've looked again and can't find what (fd_delete() is done, then all
> the free and close). Well, fd_delete() already does a close(), so maybe
> we're having an issue on freebsd with two consecutive close() on the same
> fd that we don't have on another OS. I think we could move the fd_delete()
> to session.c instead of frontend.c, since it's the one that does the
> fd_insert().

I don't think it comes from the close(), as a non fatal error (return 0) also 
closes the descriptor in the loop.

I've noticed this part of code in stream_sock.c :
                        /* critical error encountered, generally a resource 
shortage */
                        if (p) {
                                EV_FD_CLR(fd, DIR_RD);
                                p->state = PR_STIDLE;
                        }
I think this EV_FD_CLR call is responsible of the issue. By commenting it, I 
also don't reproduce the issue.

> But anyway, I think that in your tests with your change, you should see the
> message at least once, with the difference that it is not fatal.

Yes, the connection is still reset by peer.

--
Cyril Bonté

Re: 1.5 badly dies after a few seconds

Reply via email to