I think I've found a problem that is causing at least some of my runs of
the MT tests to abort or hang. The issue is that in the OB1 request
structure there is a req_send_range_lock that is never initialized with
the appropriate (pthread_)mutex_init call. I've put in the following
patch (given to me by Jeff) in ompi/mca/pml/ob1/pml_ob1_sendreq.c
Index: pml_ob1_sendreq.c
===================================================================
--- pml_ob1_sendreq.c (revision 15535)
+++ pml_ob1_sendreq.c (working copy)
@@ -136,12 +136,18 @@
req->req_rdma_cnt = 0;
req->req_throttle_sends = false;
OBJ_CONSTRUCT(&req->req_send_ranges, opal_list_t);
+ OBJ_CONSTRUCT(&req->req_send_range_lock, opal_mutex_t);
}
+static void mca_pml_ob1_send_request_destruct
(mca_pml_ob1_send_request_t* req)
+{
+ OBJ_DESTRUCT(&req->req_send_range_lock);
+}
+
OBJ_CLASS_INSTANCE( mca_pml_ob1_send_request_t,
mca_pml_base_send_request_t,
mca_pml_ob1_send_request_construct,
- NULL );
+ mca_pml_ob1_send_request_destruct);
/**
* Completion of a short message - nothing left to schedule. Note that
this
The above seems to at least allow one of my tests to consistently pass
(haven't tried the other tests yet). I was wanting to see if the above
fix makes sense and if possibly there are similar issues with the other
PMLs.
Thanks,
--td