I am seeking input before making a change to the ompi/proc/proc.c code to resolve the referenced ticket. The change could potentially impact how the ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't impact you, please ignore the remainder of this note.
I was asked last week to take a look at ticket #1267, filed by Tim Prins several months ago. This ticket references a report on the devel list about thread locks when calling comm_spawn and using MPI_Init_thread. The thread lock is caused by the constructor/destructor in the ompi_proc_t class explicitly removing the referenced ompi_proc_t from the static local global ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that list operation. As far as I can see, Tim correctly resolved the constructor conflict by simply removing the thread lock/unlock and list append operation from the constructor. A scan of the code shows that OBJ_NEW is -only- called from within the ompi/proc/proc.c code, so this won't be an issue. However, I noted several issues surrounding the creation and release of ompi_proc_t objects that -may- cause problems in making a similar change to the destructor to fix the rest of the threading problem. These may have been created in response to the list modification code currently existing in the ompi_proc_t object destructor - or they may be due to other factors. Specifically, the MPI code base outside of ompi/proc/proc.c: 1. -never- treats the ompi_proc_t as an opal object. Instead, the code specifically calls calloc to create space for the structures, and then manually constructs them. 2. consistently calls OBJ_RELEASE on the resulting structures, even though they were never created as opal objects via OBJ_NEW. I confess to being puzzled here as the destructor will attempt to remove the referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c, but (since OBJ_NEW wasn't called) that object won't be on the list. Looking at the code itself, it appears we rely on the list function to realize that this object isn't on the list and ignore the request to remove it. We don't check the return code from the opal_list_remove_item, and so ignore the returned error. My point here is to seek comment about the proposed fix for the problem referenced in the ticket. My proposal is to remove the thread lock/unlock and list manipulation from the ompi_proc_t destructor. From what I can see (as described above), this should not impact the rest of the code base. I will then add thread lock/unlock protection explicitly to ompi_proc_finalize to protect its list operations. It appears to me that this change would also open the way to allowing the remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW to correctly and consistently initialize those objects. I note that any change to ompi_proc_t today (which has occurred in the not-too-distant past!) can create a problem throughout the current code base due to the manual construction of this object. This is why we have objects in the first place - I suspect people didn't use OBJ_NEW because of the automatic change it induced in the ompi_proc_list in ompi/proc/proc.c. Any comments would be welcome. Ralph