On May 27, 2014, at 6:11 PM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:
> Ralph, > > in the case of intercomm_create, the children free all the communicators and > then MPI_Disconnect() and then MPI_Finalize() and exits. > the parent only MPI_Disconnect() without freeing all the communicators. > MPI_Finalize() tries to disconnect and communicate with already exited > processes. > > my understanding is that there are two ways of seeing things : > a) the "R-way" : the problem is the parent should not try to communicate to > already exited processes > b) the "J-way" : the problem is the children should have waited either in > MPI_Comm_free() or MPI_Finalize() I don't think you can use option (b) - we can't have the children lingering around for the parent to call finalize, if I'm understanding you correctly. When I look at loop_spawn, I see this being done by the parent on every iteration: MPI_Init( &argc, &argv); loop() { MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &comm, &err); printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err); MPI_Intercomm_merge(comm, 0, &merged); MPI_Comm_rank(merged, &rank); MPI_Comm_size(merged, &size); printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n", iter, rank, size); MPI_Comm_free(&merged); } MPI_Finalize(); The child does: MPI_Init(&argc, &argv); MPI_Comm_get_parent(&parent); MPI_Intercomm_merge(parent, 1, &merged); MPI_Comm_rank(merged, &rank); MPI_Comm_size(merged, &size); printf("Child merged rank = %d, size = %d\n", rank, size); MPI_Comm_free(&merged); MPI_Finalize(); So it looks to me like there is either something missing, or a bug in Comm_free that isn't removing the child from the parent's field of view. > > i did not investigate the loop_spawn test yet, and will do today. > > as far as i am concerned, i have no opinion on which of a) or b) is the > correct/most appropriate approach. > > Cheers, > > Gilles > > > On Wed, May 28, 2014 at 9:46 AM, Ralph Castain <r...@open-mpi.org> wrote: > Since you ignored my response, I'll reiterate and clarify it here. The > problem in the case of loop_spawn is that the parent process remains > "connected" to children after the child has finalized and died. Hence, when > the parent attempts to finalize, it tries to "disconnect" itself from > processes that no longer exist - and that is what generates the error message. > > So the issue in that case appears to be that "finalize" is not marking the > child process as "disconnected", thus leaving the parent thinking that it > needs to disconnect when it finally ends. > > > On May 27, 2014, at 5:33 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > Note that MPI says that COMM_DISCONNECT simply disconnects that individual > > communicator. It does *not* guarantee that the processes involved will be > > fully disconnected. > > > > So I think that the freeing of communicators is good app behavior, but it > > is not required by the MPI spec. > > > > If OMPI is requiring this for correct termination, then something is wrong. > > MPI_FINALIZE is supposed to be collective across all connected MPI procs > > -- and if the parent and spawned procs in this test are still connected > > (because they have not disconnected all communicators between them), the > > FINALIZE is supposed to be collective across all of them. > > > > This means that FINALIZE is allowed to block if it needs to, such that OMPI > > sending control messages to procs that are still "connected" (in the MPI > > sense) should never cause a race condition. > > > > As such, this sounds like an OMPI bug. > > > > > > > > > > On May 27, 2014, at 2:27 AM, Gilles Gouaillardet > > <gilles.gouaillar...@gmail.com> wrote: > > > >> Folks, > >> > >> currently, the dynamic/intercomm_create test from the ibm test suite > >> output the following messages : > >> > >> dpm_base_disconnect_init: error -12 in isend to process 1 > >> > >> the root cause it task 0 tries to send messages to already exited tasks. > >> > >> one way of seeing things is that this is an application issue : > >> task 0 should have MPI_Comm_free'd all its communicator before calling > >> MPI_Comm_disconnect. > >> This can be achieved via the attached patch > >> > >> an other way of seeing things is that this is a bug in OpenMPI. > >> In this case, what would be the the right approach ? > >> - automatically free communicators (if needed) when MPI_Comm_disconnect is > >> invoked ? > >> - simply remove communicators (if needed) from ompi_mpi_communicators when > >> MPI_Comm_disconnect is invoked ? > >> /* this causes a memory leak, but the application can be seen as > >> responsible of it */ > >> - other ? > >> > >> Thanks in advance for your feedback, > >> > >> Gilles > >> <intercomm_create.patch>_______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/05/14847.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/05/14875.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14876.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14877.php