Ok figured it out. There were three problems with the del_procs code: 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but never released the reference to them (ompi_proc_all called OBJ_RETAIN on all the procs returned). When calling del_procs at finalize it should suffice to call ompi_proc_world which does not increment the reference count.
2) del_procs is called BEFORE ompi_comm_finalize. This leaves the references to the procs from calling the pml_add_comm function. The fix is to reorder the calls to do omp_comm_finalize, del_procs, pml_finalize instead of del_procs, pml_finalize, ompi_comm_finalize. 3) The check in del_procs in r2 checked for a reference count of 1. This is incorrect. At this point there should be 2 references: 1 from ompi_proc, and another from the add_procs. The fix is to change this check to look for 2. This check makes me extremely uncomforable as nothing will call del_procs if the reference count of a procs is not 2 when del_procs is called. Maybe there should be an assert since this is a developer error IMHO. Committing a patch to fix all three of these issues. -Nathan On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote: > On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote: > > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote: > > > The solution you propose here is definitively not OK. It is 1) ugly and > > > 2) break the separation barrier that we hold dear. > > > > Which is why I asked :) > > > > > Regarding your other suggestion I don’t see any reasons not to call the > > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing > > > down everything else. > > > > I spoke too soon. It looks like we *are* calling del_procs but I am not > > seeing the call reach the bml.... I will try and track this down. > > /bml/btl/ .. I see what is happening. The proc reference counts are all > larger than 1 when we call del_procs: > > > [1,2]<stderr>:Deleting proc 0x7b83190 with reference count 5 > [1,1]<stderr>:Deleting proc 0x7b83180 with reference count 5 > [1,2]<stderr>:Deleting proc 0x7b832b0 with reference count 5 > [1,1]<stderr>:Deleting proc 0x7b832a0 with reference count 7 > [1,2]<stderr>:Deleting proc 0x7b83360 with reference count 7 > [1,1]<stderr>:Deleting proc 0x7b833a0 with reference count 5 > [1,0]<stderr>:Deleting proc 0x7b83190 with reference count 7 > [1,0]<stderr>:Deleting proc 0x7b83300 with reference count 5 > [1,0]<stderr>:Deleting proc 0x7b833b0 with reference count 5 > > > I will track that down. > > -Nathan > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14812.php
pgp00dWQ5nXSm.pgp
Description: PGP signature