Ok figured it out. There were three problems with the del_procs code:

 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
    never released the reference to them (ompi_proc_all called
    OBJ_RETAIN on all the procs returned). When calling del_procs at
    finalize it should suffice to call ompi_proc_world which does not
    increment the reference count.

 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
    references to the procs from calling the pml_add_comm function. The
    fix is to reorder the calls to do omp_comm_finalize, del_procs,
    pml_finalize instead of del_procs, pml_finalize,
    ompi_comm_finalize.

 3) The check in del_procs in r2 checked for a reference count of
    1. This is incorrect. At this point there should be 2 references: 1
    from ompi_proc, and another from the add_procs. The fix is to change
    this check to look for 2. This check makes me extremely uncomforable
    as nothing will call del_procs if the reference count of a procs is
    not 2 when del_procs is called. Maybe there should be an assert
    since this is a developer error IMHO.

Committing a patch to fix all three of these issues.

-Nathan

On Thu, May 15, 2014 at 11:52:27AM -0600, Nathan Hjelm wrote:
> On Thu, May 15, 2014 at 11:44:05AM -0600, Nathan Hjelm wrote:
> > On Thu, May 15, 2014 at 01:33:31PM -0400, George Bosilca wrote:
> > > The solution you propose here is definitively not OK. It is 1) ugly and 
> > > 2) break the separation barrier that we hold dear.
> > 
> > Which is why I asked :)
> > 
> > > Regarding your other suggestion I don’t see any reasons not to call the 
> > > delete_proc on MPI_COMM_WORLD as the last action we do before tearing 
> > > down everything else.
> > 
> > I spoke too soon. It looks like we *are* calling del_procs but I am not
> > seeing the call reach the bml.... I will try and track this down.
> 
> /bml/btl/ .. I see what is happening. The proc reference counts are all
> larger than 1 when we call del_procs:
> 
> 
> [1,2]<stderr>:Deleting proc 0x7b83190 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b83180 with reference count 5
> [1,2]<stderr>:Deleting proc 0x7b832b0 with reference count 5
> [1,1]<stderr>:Deleting proc 0x7b832a0 with reference count 7
> [1,2]<stderr>:Deleting proc 0x7b83360 with reference count 7
> [1,1]<stderr>:Deleting proc 0x7b833a0 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b83190 with reference count 7
> [1,0]<stderr>:Deleting proc 0x7b83300 with reference count 5
> [1,0]<stderr>:Deleting proc 0x7b833b0 with reference count 5
> 
> 
> I will track that down.
> 
> -Nathan



> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14812.php

Attachment: pgp00dWQ5nXSm.pgp
Description: PGP signature

Reply via email to