I am seeking input before making a change to the ompi/proc/proc.c code to
resolve the referenced ticket. The change could potentially impact how the
ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
impact you, please ignore the remainder of this note.


I was asked last week to take a look at ticket #1267, filed by Tim Prins
several months ago. This ticket references a report on the devel list about
thread locks when calling comm_spawn and using MPI_Init_thread. The thread
lock is caused by the constructor/destructor in the ompi_proc_t class
explicitly removing the referenced ompi_proc_t from the static local global
ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
list operation.

As far as I can see, Tim correctly resolved the constructor conflict by
simply removing the thread lock/unlock and list append operation from the
constructor. A scan of the code shows that OBJ_NEW is -only- called from
within the ompi/proc/proc.c code, so this won't be an issue.

However, I noted several issues surrounding the creation and release of
ompi_proc_t objects that -may- cause problems in making a similar change to
the destructor to fix the rest of the threading problem. These may have been
created in response to the list modification code currently existing in the
ompi_proc_t object destructor - or they may be due to other factors.

Specifically, the MPI code base outside of ompi/proc/proc.c:

1. -never- treats the ompi_proc_t as an opal object. Instead, the code
specifically calls calloc to create space for the structures, and then
manually constructs them.

2. consistently calls OBJ_RELEASE on the resulting structures, even though
they were never created as opal objects via OBJ_NEW.

I confess to being puzzled here as the destructor will attempt to remove the
referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c,
but (since OBJ_NEW wasn't called) that object won't be on the list. Looking
at the code itself, it appears we rely on the list function to realize that
this object isn't on the list and ignore the request to remove it. We don't
check the return code from the opal_list_remove_item, and so ignore the
returned error.

My point here is to seek comment about the proposed fix for the problem
referenced in the ticket. My proposal is to remove the thread lock/unlock
and list manipulation from the ompi_proc_t destructor. From what I can see
(as described above), this should not impact the rest of the code base. I
will then add thread lock/unlock protection explicitly to ompi_proc_finalize
to protect its list operations.

It appears to me that this change would also open the way to allowing the
remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW
to correctly and consistently initialize those objects. I note that any
change to ompi_proc_t today (which has occurred in the not-too-distant
past!) can create a problem throughout the current code base due to the
manual construction of this object. This is why we have objects in the first
place - I suspect people didn't use OBJ_NEW because of the automatic change
it induced in the ompi_proc_list in ompi/proc/proc.c.

Any comments would be welcome.
Ralph


Reply via email to