On May 28, 2014, at 4:31 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

> On May 27, 2014, at 9:11 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> in the case of intercomm_create, the children free all the communicators and 
>> then MPI_Disconnect() and then MPI_Finalize() and exits.
>> the parent only MPI_Disconnect() without freeing all the communicators. 
>> MPI_Finalize() tries to disconnect and communicate with already exited 
>> processes.
>> 
>> my understanding is that there are two ways of seeing things :
>> a) the "R-way" : the problem is the parent should not try to communicate to 
>> already exited processes
>> b) the "J-way" : the problem is the children should have waited either in 
>> MPI_Comm_free() or MPI_Finalize()
> 
> I didn't ignore Ralph's email;

I was just pulling you chain, Jeff :-)

> I was pointing out what the MPI semantics are supposed to be.
> 
> I had only a short time this morning to look at the intercomm_create test 
> program, and it looks like Gilles might be correct -- the children are 
> freeing all relevant communicators *but the parent is not*.  If this is, 
> indeed, correct, then a) OMPI's implementation might be fine because the test 
> program is erroneous (i.e., the children *think* that they are disconnected 
> and therefore allow themselves to exit, but the parents *think* that they are 
> still connected and therefore try to contact the children during the parents' 
> MPI_FINALIZE), and b) his original patch to the test program could well be 
> correct.

Agreed - however, I find it concerning that loop_spawn, which does have every 
process calling comm_free, is showing the same symptom upon the parent calling 
finalize.

> 
> I won't have time to investigate this today; if someone else could look at 
> the test code and confirm whether this is correct or not, that would be 
> appreciated.
> 
>> as far as i am concerned, i have no opinion on which of a) or b) is the 
>> correct/most appropriate approach.
> 
> To be totally clear: MPI says it is erroneous for only some (not all) 
> processes in a communicator to call MPI_COMM_FREE.  So if that's the real 
> problem, then the discussion about why the parent(s) is(are) trying to 
> contact the children is moot -- the test is erroneous, and erroneous 
> application behavior is undefined.
> 
> All that being said, if we want to make this error case a bit friendlier to 
> the user, that would be great (i.e., a show_help something like "It looks 
> like an MPI process is trying to contact another MPI process that has already 
> exited/called MPI_FINALIZE.  This is almost certainly an error in the 
> application...").

Also agreed, assuming we can find the right place to correctly determine that 
is what is happening.


>  But that's definitely extra bonus points, and not required.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14883.php

Reply via email to