Re: HAProxy 1.8.X crashing

Willy Tarreau Thu, 12 Apr 2018 14:35:10 -0700

On Thu, Apr 12, 2018 at 01:14:58PM +0200, Olivier Houchard wrote:
> > > @@ -3718,6 +3719,8 @@ int http_process_request(struct stream *s, struct 
> > > channel *req, int an_bit)
> > >  
> > >                   return 0;
> > >           }
> > > +         /* XXX: We probably need a better mux */
> > > +         conn_install_mux(conn, &mux_pt_ops, objt_cs(s->si[1].end));
> > >  
> > >           path = http_get_path(txn);
> > >           url2sa(req->buf->p + msg->sl.rq.u,
> > 
> > While I can understand how missing this can have caused trouble, I'm not
> > sure it's the best solution, and I'm a bit worried about having various
> > code places choose a mux and install it. I suspect we're "simply" missing
> > a test in the backend code to cover the case where the connection already
> > exists but a mux was not yet installed (a situation that didn't exist
> > prior to muxes which is why we didn't notice this). I think that this
> > happens in connect_server() around line 1177 (return SF_ERR_INTERNAL)
> > when conn_xprt_ready() indicates the transport was not yet initialized.
> > It's likely that the error unrolling fails to consider the case where
> > the mux wasn't initialized, leading to the crash Praveen experienced.
> > 
> 
> I'm not sure about this. My point was to try to install the mux when we can
> choose one, the day we have more relevant muxes, if it ever happens.
> This certainly fixes Praveen's crash.


I totally understand, I just claim that I think we're likely buying a
carpet to put it on the dust instead of removing the dust. It's more of
a conceptual issue : you're forced to select an arbitrary mux at a moment
where you have no idea which one will be needed for this outgoing connection.
In 1.9 it could be H2, later it could be FCGI, etc. And generally when we
have to fill arbitrary information, we pay it the high price later because
it creates unnatural bonds between some unrelated elements.

The single fact that we're forced to pick an arbitrary mux to avoid a crash
somewhere else tells me we have an issue. If one specific code path doesn't
support being called with a muxless connection, this same code path might
also be triggered when the malloc() to create the mux fails and we want to
gracefully abort and try to close this muxless connection after sending a
500 Server Error status.

Also, the only reason why we create the connection this early is that we
need a place where to put the IP addresses, and given that connections
are small (only very slightly larger than the addresses they convey), it
totally makes sense to pre-populate the address there. If we start to add
more complex muxes, we'll even possibly end up allocating more resources
(think h2+hpack contexts), trying to connect and to initiate a handshake
to a target before the rest is finished or whatever funny stuff.

Now don't get me wrong, it would very well be that for now, given the
current state of the code, we're missing many checks and this solution
will be an efficient short-term fix. But for the reasons above it doesn't
sound logical to me. Maybe we're simply missing an "if (conn->mux)" or
"if (cs->mux)" somewhere on the free path before calling something for
the mux because it was assumed that it did obvously exist.

> I don't even know how we could reach line 1177.

I'm uncertain to be honest, I just think that there is a small possibility
if the connection exists and not the transport, but I could be wrong.
> 
> > If this is right, it would mean two fixes :
> >   - one in the error unrolling path to ensure we don't destroy a non
> >     allocated mux ;
> 
> And what do we do then ? Without the mux we can't destroy the cs nor the
> connection.

That may be the problem. But if a mux fails to initialize, then we can't
destroy them either ? Is cs_destroy() the problem here maybe ? If so, I
suspect that this approach would be more robust and would better match
the validity tests on conn->mux that are performed at many places in the
connection code :

        if (cs->conn->mux) {
            cs->conn->mux->detach(cs);
        } else {
            conn_stop_tracking(cs->conn);
            conn_xprt_close(cs->conn);
            conn_free(cs->conn);
        }
        cs_free(cs);

> >   - one above the return SF_ERR_INTERNAL above in connect_server() to
> >     handle the case where the connection already exists but not the mux.
> 
> I'm not sure how this can happen, and what we should do if that happens.

At least in the current situation we pre-create a connection just to start
to fill some information (source/destination address), without knowing the
mux yet. It's an initialization process, it may look beautiful or ugly
depending on the perspective, but in any case we must ensure that the
code overall remains reliable in this situation. And to me it looks less
ugly than picking a possibly wrong mux just to have a valid pointer
avoiding a segfault.

But if it turns out that stabilizing the destroy path (or error path) is
not sufficient and that it breaks somewhere else, we can well adopt this
solution for the short term and see how to address it for the long term;
we already know that the outgoing connection code is severely limited to
the point of currently preventing us from using H2 on the backend, and
that's exactly why we're currently working on it.

Cheers,
Willy

Re: HAProxy 1.8.X crashing

Reply via email to