I think I've found a problem that is causing at least some of my runs of the MT tests to abort or hang. The issue is that in the OB1 request structure there is a req_send_range_lock that is never initialized with the appropriate (pthread_)mutex_init call. I've put in the following patch (given to me by Jeff) in ompi/mca/pml/ob1/pml_ob1_sendreq.c

Index: pml_ob1_sendreq.c
===================================================================
--- pml_ob1_sendreq.c   (revision 15535)
+++ pml_ob1_sendreq.c   (working copy)
@@ -136,12 +136,18 @@
    req->req_rdma_cnt = 0;
    req->req_throttle_sends = false;
    OBJ_CONSTRUCT(&req->req_send_ranges, opal_list_t);
+    OBJ_CONSTRUCT(&req->req_send_range_lock, opal_mutex_t);
}
+static void mca_pml_ob1_send_request_destruct (mca_pml_ob1_send_request_t* req)
+{
+    OBJ_DESTRUCT(&req->req_send_range_lock);
+}
+
OBJ_CLASS_INSTANCE( mca_pml_ob1_send_request_t,
                    mca_pml_base_send_request_t,
                    mca_pml_ob1_send_request_construct,
-                    NULL );
+                    mca_pml_ob1_send_request_destruct);
/**
* Completion of a short message - nothing left to schedule. Note that this

The above seems to at least allow one of my tests to consistently pass (haven't tried the other tests yet). I was wanting to see if the above fix makes sense and if possibly there are similar issues with the other PMLs.

Thanks,

--td

Reply via email to