Calling MPI_Comm_free is not enough from MPI perspective to clean up all knowledge about remote processes, nor to sever the links between the local and remote groups. One MUST call MPI_Comm_disconnect in order to achieve this.
Look at the code in ompi/mpi/c and see the difference between MPI_Comm_free and MPI_Comm_disconnect. In addition to the barrier only disconnect calls into the DPM framework, giving a chance to further cleanup. George. On Wed, May 28, 2014 at 10:10 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On May 28, 2014, at 6:41 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > > On Wed, May 28, 2014 at 9:33 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> This is definetly what happens : only some tasks call MPI_Comm_free() >> >> >> Really? I don't see how that can happen in loop_spawn - every process is >> clearly calling comm_free. Or are you referring to the intercomm_create >> test? >> > yes, i am referring intercomm_create test > > > kewl - thanks > > > about loop_spawn, i could not get any error on my single host single socket > VM. > (i tried --mca btl tcp,sm,self and --mca btl tcp,self) > > MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the > error message on the parent of intercomm_create. > a necessary condition is ompi_comm_num_dyncomm > 1 > /* which by the way sounds odd to me, should it be 0 ? */ > > > That does sound odd > > which imho cannot happen if all communicators have been freed > > can you detail your full mpirun command line, the number of servers you are > using, the btl involved and the ompi release that can be used to reproduce > the issue ? > > > Running on only one server, using the current head of the svn repo. My > cluster only has Ethernet, and I let it freely choose the BTLs (so I imagine > the candidates are sm,self,tcp,vader). The cmd line is really trivial: > > mpirun -n 1 ./loop_spawn > > I modified loop_spawn to only run 100 iterations as I am not patient enough > to wait for 1000, and the number of iters isn't a factor so long as it is > greater than 1. When the parent calls finalize, I get one of the following > emitted for every iteration that was done: > > dpm_base_disconnect_init: error -12 in isend to process 0 > > So in other words, the parent is attempting to isend to every child that was > spawned during the test - it thinks that every comm_spawn'd process remains > connected to it. > > I'm wondering if the issue is that the parent and child are calling > comm_free, but neither side called comm_disconnect. So unless comm_free is > calling disconnect under-the-covers, it might explain why the parent thinks > all the children are still present. > > > > i will try to reproduce this myself > > Cheers, > > Gilles > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14890.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14891.php