On Wed, 2 Jun 2010, Jeff Squyres wrote:

Don't you mean return NULL?  This function is supposed to return a (struct 
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make ibv_create_cq always return NULL, the scenario described by George works smoothly : returned OMPI_ERROR => bitmask cleared => connectivity problem => stop or tcp fallback. The problem is more complicated than I thought.

But it made me progress on why I'm crashing : in my case, only a subset of processes have their create_cq fail. But others work fine, hence they request a qp creation, and my process which failed over on tcp starts creating a qp ... and crashes.

If you replace :
    return NULL;
by :
    if (atoi(getenv("OMPI_COMM_WORLD_RANK")) == 26)
        return NULL;
(yes, that's ugly, but it's just to debug the problem) and run on -say- 32 processes, you should be able to reproduce the bug. Well, unless I'm mistaken again.

The crash stack should look like this :
#0  0x0000003d0d605a30 in ibv_cmd_create_qp () from /usr/lib64/libibverbs.so.1
#1  0x00007f28b44e049b in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
#2  0x0000003d0d609a42 in ibv_create_qp () from /usr/lib64/libibverbs.so.1
#3  0x00007f28b6be6e6e in qp_create_one () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#4  0x00007f28b6be78a4 in oob_module_start_connect () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#5  0x00007f28b6be7fbb in rml_recv_cb () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#6  0x00007f28b8c56868 in orte_rml_recv_msg_callback () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_rml_oob.so
#7  0x00007f28b8a4cf96 in mca_oob_tcp_msg_recv_complete () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#8  0x00007f28b8a4e2c2 in mca_oob_tcp_peer_recv_handler () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#9  0x00007f28b9496898 in opal_event_base_loop () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#10 0x00007f28b948ace9 in opal_progress () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#11 0x00007f28b9951ed5 in ompi_request_default_wait_all () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libmpi.so.0

This new advance may change everything. Of course, stopping at the bml level still "solves" the problem, but maybe we can fix this more properly within the openib BTL. Unless this is a general out-of-band-connection-protocol issue ().

Sylvain

Reply via email to