Hi, Reference is the v1.5 branch
If an SRQ has the following settings: S,<size>,4,2,1 1) setup_qps() sets the following: mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4 mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1 2) create_srq() sets the following: openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value) openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num >> 2) = rd_curr_num = 1 3) if mca_btl_openib_post_srr() is called with rd_posted=1: rd_posted > rd_low_local is false num_post=rd_curr_num-rd_posted=0 the loop is not executed wr is never initialized (remains NULL) wr->next: address not mapped ==> SIGSEGV The attached patch solves the problem by ensuring that we'll actually enter the loop and leave otherwise. Can someone have a look please: the patch solves the problem with my reproducer, but I'm not sure the fix covers all the situations. Regards, Nadia
openib btl unsafe in case of extremely low srq settings diff -r eb32fad15d19 ompi/mca/btl/openib/btl_openib_component.c --- a/ompi/mca/btl/openib/btl_openib_component.c Wed Jun 09 17:39:55 2010 +0200 +++ b/ompi/mca/btl/openib/btl_openib_component.c Fri Jun 18 17:00:12 2010 +0200 @@ -3543,6 +3543,11 @@ int mca_btl_openib_post_srr(mca_btl_open } num_post = rd_curr_num - openib_btl->qps[qp].u.srq_qp.rd_posted; + if (0 == num_post) { + OPAL_THREAD_UNLOCK(&openib_btl->ib_lock); + return OMPI_SUCCESS; + } + for(i = 0; i < num_post; i++) { ompi_free_list_item_t* item; OMPI_FREE_LIST_WAIT(&openib_btl->device->qps[qp].recv_free, item, rc);