Each collective is given a "signature" that is just the array of names for all 
procs involved in the collective. Thus, even though task 0 is involved in both 
of the disconnect barriers, the two collectives should be running in isolation 
from each other.

The "tags" are just receive callbacks and have no meaning other than to 
associate a particular callback to a given send/recv pair. It is the signature 
that counts as the daemons are using that to keep the various collectives 
separated.

I'll have to take a look at why task 2 is leaving early. The key will be to 
look at that signature to ensure we aren't getting it confused.

On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Folks,
> 
> when i run
> mpirun -np 1 ./intercomm_create
> from the ibm test suite, it either :
> - success
> - hangs
> - mpirun crashes (SIGSEGV) soon after writing the following message
> ORTE_ERROR_LOG: Not found in file
> ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
> 
> here is what happens :
> 
> first, the test program itself :
> task 0 spawns task 1 : the inter communicator is ab_inter on task 0 and
> parent on task 1
> then
> task 0 spawns task 2 : the inter communicator is ac_inter on task 0 and
> parent on task 2
> then
> several operations (merge, barrier, ...)
> and then without any synchronization :
> - task 0 MPI_Comm_disconnect(ab_inter) and then
> MPI_Comm_disconnect(ac_inter)
> - task 1 and task 2 MPI_Comm_disconnect(parent)
> 
> i applied the attached pmix_debug.patch and ran
> mpirun -np 1 --mca pmix_base_verbose 90 ./intercomm_create
> 
> basically, tasks 0 and 1 execute a native fence and in parallel, tasks 0
> and 2 execute a native fence.
> they both use the *same* tags on different though overlapping tasks
> bottom line, task 2 leave the fences *before* task 0 enterred the fence
> (it seems task 1 told task 2 it is ok to leave the fence)
> 
> a simple work around is to call MPI_Barrier before calling
> MPI_Comm_disconnect
> 
> at this stage, i doubt it is even possible to get this working at the
> pmix level, so the fix
> might be to have MPI_Comm_disconnect invoke MPI_Barrier
> the attached comm_disconnect.patch always call the barrier before
> (indirectly) invoking pmix
> 
> could you please comment on this issue ?
> 
> Cheers,
> 
> Gilles
> 
> here are the relevant logs :
> 
> [soleil:00650] [[8110,3],0] pmix:native executing fence on 2 procs
> [[8110,1],0] and [[8110,3],0]
> [soleil:00650] [[8110,3],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
> send to server
> [soleil:00650] [[8110,3],0] posting recv on tag 5
> [soleil:00650] [[8110,3],0] usock:send_nb: already connected to server -
> queueing for send
> [soleil:00650] [[8110,3],0] usock:send_handler called to send to server
> [soleil:00650] [[8110,3],0] usock:send_handler SENDING TO SERVER
> [soleil:00647] [[8110,2],0] pmix:native executing fence on 2 procs
> [[8110,1],0] and [[8110,2],0]
> [soleil:00647] [[8110,2],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/pmix_native.c:493] post
> send to server
> [soleil:00647] [[8110,2],0] posting recv on tag 5
> [soleil:00647] [[8110,2],0] usock:send_nb: already connected to server -
> queueing for send
> [soleil:00647] [[8110,2],0] usock:send_handler called to send to server
> [soleil:00647] [[8110,2],0] usock:send_handler SENDING TO SERVER
> [soleil:00650] [[8110,3],0] usock:recv:handler called
> [soleil:00650] [[8110,3],0] usock:recv:handler CONNECTED
> [soleil:00650] [[8110,3],0] usock:recv:handler allocate new recv msg
> [soleil:00650] usock:recv:handler read hdr
> [soleil:00650] [[8110,3],0] usock:recv:handler allocate data region of
> size 14
> [soleil:00650] [[8110,3],0] RECVD COMPLETE MESSAGE FROM SERVER OF 14
> BYTES FOR TAG 5
> [soleil:00650] [[8110,3],0]
> [../../../../../../src/ompi-trunk/opal/mca/pmix/native/usock_sendrecv.c:415]
> post msg
> [soleil:00650] [[8110,3],0] message received 14 bytes for tag 5
> [soleil:00650] [[8110,3],0] checking msg on tag 5 for tag 5
> [soleil:00650] [[8110,3],0] pmix:native recv callback activated with 14
> bytes
> [soleil:00650] [[8110,3],0] pmix:native fence released on 2 procs
> [[8110,1],0] and [[8110,3],0]
> 
> 
> <pmix_debug.patch><comm_disconnect.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15701.php

Reply via email to