Hi,

Reference is the v1.5 branch

If an SRQ has the following settings: S,<size>,4,2,1

1) setup_qps() sets the following:
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1

2) create_srq() sets the following:
openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value)
openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num
>> 2) = rd_curr_num = 1

3) if mca_btl_openib_post_srr() is called with rd_posted=1:
rd_posted > rd_low_local is false
num_post=rd_curr_num-rd_posted=0
the loop is not executed
wr is never initialized (remains NULL)
wr->next: address not mapped
         ==> SIGSEGV

The attached patch solves the problem by ensuring that we'll actually
enter the loop and leave otherwise.
Can someone have a look please: the patch solves the problem with my
reproducer, but I'm not sure the fix covers all the situations.

Regards,
Nadia
openib btl unsafe in case of extremely low srq settings

diff -r eb32fad15d19 ompi/mca/btl/openib/btl_openib_component.c
--- a/ompi/mca/btl/openib/btl_openib_component.c	Wed Jun 09 17:39:55 2010 +0200
+++ b/ompi/mca/btl/openib/btl_openib_component.c	Fri Jun 18 17:00:12 2010 +0200
@@ -3543,6 +3543,11 @@ int mca_btl_openib_post_srr(mca_btl_open
     }
     num_post = rd_curr_num - openib_btl->qps[qp].u.srq_qp.rd_posted;

+    if (0 == num_post) {
+        OPAL_THREAD_UNLOCK(&openib_btl->ib_lock);
+        return OMPI_SUCCESS;
+    }
+
     for(i = 0; i < num_post; i++) {
         ompi_free_list_item_t* item;
         OMPI_FREE_LIST_WAIT(&openib_btl->device->qps[qp].recv_free, item, rc);

Reply via email to