I don't have any IB nodes, but I'm interested to see how this happens. What I would like to understand here is how do we get back in the OpenIB code if the add_procs failed for the BTL ...
george. On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote: > On Tue, 1 Jun 2010, Jeff Squyres wrote: > >> On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: >> >>> In my case, the error happens in : >>> mca_btl_openib_add_procs() >>> mca_btl_openib_size_queues() >>> adjust_cq() >>> ibv_create_cq_compat() >>> ibv_create_cq() >> >> Can you nail this down any further? If I modify adjust_cq() to always >> return OMPI_ERROR, I see the openib BTL fail over properly to the TCP BTL. > It must be because create_cq actually creates cqs. Try to apply this patch > which makes create_cq_compat() *not* creates the cqs and return an error > instead : > ======================================================================== > diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c > --- a/ompi/mca/btl/openib/btl_openib.c Fri May 28 14:50:25 2010 +0200 > +++ b/ompi/mca/btl/openib/btl_openib.c Wed Jun 02 10:56:57 2010 +0200 > @@ -146,6 +146,7 @@ > int cqe, void *cq_context, struct ibv_comp_channel *channel, > int comp_vector) > { > + return OMPI_ERROR; > #if OMPI_IBV_CREATE_CQ_ARGS == 3 > return ibv_create_cq(context, cqe, channel); > #else > ======================================================================== > > You should see MPI_Init complete nicely and your application segfault on the > next MPI operation. > > Sylvain > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel