On May 27, 2014, at 6:11 PM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:

> Ralph,
> 
> in the case of intercomm_create, the children free all the communicators and 
> then MPI_Disconnect() and then MPI_Finalize() and exits.
> the parent only MPI_Disconnect() without freeing all the communicators. 
> MPI_Finalize() tries to disconnect and communicate with already exited 
> processes.
> 
> my understanding is that there are two ways of seeing things :
> a) the "R-way" : the problem is the parent should not try to communicate to 
> already exited processes
> b) the "J-way" : the problem is the children should have waited either in 
> MPI_Comm_free() or MPI_Finalize()

I don't think you can use option (b) - we can't have the children lingering 
around for the parent to call finalize, if I'm understanding you correctly.

When I look at loop_spawn, I see this being done by the parent on every 
iteration:

     MPI_Init( &argc, &argv);

   loop() {
       MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL,
                       0, MPI_COMM_WORLD, &comm, &err);
        printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err);

        MPI_Intercomm_merge(comm, 0, &merged);
        MPI_Comm_rank(merged, &rank);
        MPI_Comm_size(merged, &size);
        printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n", 
               iter, rank, size);
        MPI_Comm_free(&merged);
   }
    MPI_Finalize();


The child does:

    MPI_Init(&argc, &argv);   
    MPI_Comm_get_parent(&parent);   
    MPI_Intercomm_merge(parent, 1, &merged);
    MPI_Comm_rank(merged, &rank);
    MPI_Comm_size(merged, &size);
    printf("Child merged rank = %d, size = %d\n", rank, size);
   
    MPI_Comm_free(&merged);
    MPI_Finalize();


So it looks to me like there is either something missing, or a bug in Comm_free 
that isn't removing the child from the parent's field of view.


> 
> i did not investigate the loop_spawn test yet, and will do today.
> 
> as far as i am concerned, i have no opinion on which of a) or b) is the 
> correct/most appropriate approach.
> 
> Cheers,
> 
> Gilles
> 
> 
> On Wed, May 28, 2014 at 9:46 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Since you ignored my response, I'll reiterate and clarify it here. The 
> problem in the case of loop_spawn is that the parent process remains 
> "connected" to children after the child has finalized and died. Hence, when 
> the parent attempts to finalize, it tries to "disconnect" itself from 
> processes that no longer exist - and that is what generates the error message.
> 
> So the issue in that case appears to be that "finalize" is not marking the 
> child process as "disconnected", thus leaving the parent thinking that it 
> needs to disconnect when it finally ends.
> 
> 
> On May 27, 2014, at 5:33 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> > Note that MPI says that COMM_DISCONNECT simply disconnects that individual 
> > communicator.  It does *not* guarantee that the processes involved will be 
> > fully disconnected.
> >
> > So I think that the freeing of communicators is good app behavior, but it 
> > is not required by the MPI spec.
> >
> > If OMPI is requiring this for correct termination, then something is wrong. 
> >  MPI_FINALIZE is supposed to be collective across all connected MPI procs 
> > -- and if the parent and spawned procs in this test are still connected 
> > (because they have not disconnected all communicators between them), the 
> > FINALIZE is supposed to be collective across all of them.
> >
> > This means that FINALIZE is allowed to block if it needs to, such that OMPI 
> > sending control messages to procs that are still "connected" (in the MPI 
> > sense) should never cause a race condition.
> >
> > As such, this sounds like an OMPI bug.
> >
> >
> >
> >
> > On May 27, 2014, at 2:27 AM, Gilles Gouaillardet 
> > <gilles.gouaillar...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> currently, the dynamic/intercomm_create test from the ibm test suite 
> >> output the following messages :
> >>
> >> dpm_base_disconnect_init: error -12 in isend to process 1
> >>
> >> the root cause it task 0 tries to send messages to already exited tasks.
> >>
> >> one way of seeing things is that this is an application issue :
> >> task 0 should have MPI_Comm_free'd all its communicator before calling 
> >> MPI_Comm_disconnect.
> >> This can be achieved via the attached patch
> >>
> >> an other way of seeing things is that this is a bug in OpenMPI.
> >> In this case, what would be the the right approach ?
> >> - automatically free communicators (if needed) when MPI_Comm_disconnect is 
> >> invoked ?
> >> - simply remove communicators (if needed) from ompi_mpi_communicators when 
> >> MPI_Comm_disconnect is invoked ?
> >>  /* this causes a memory leak, but the application can be seen as 
> >> responsible of it */
> >> - other ?
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> <intercomm_create.patch>_______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14847.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/05/14875.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14876.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14877.php

Reply via email to