what version of ompi are you referring to?

On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
wrote:

> Dear all,
> 
> I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. 
> In short, after two processes connect to each other with a port and merge to 
> create a intra comm (rank 0 and rank 1), only one of them (the root) is 
> thereafter able to reach a third new process through MPI_Comm_connect.
> 
> I can explain with an example:
> 
> 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - 
> process1, process2 and process3.
> 2. Process2 opens a port and waits in MPI_Comm_accept
> 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates 
> a inter-comm.
> 4. Process1 and process2 participate in MPI_Intercomm_merge and create a 
> intra-comm (say, newcomm).
> 5. Process3 has also opened a port and  is now waiting at MPI_Comm_accept
> 6. Process1 and process2 try to connect to process3 with 
> MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm)
> 
> At this stage only the root process from newcomm is able to connect to 
> process3 and the other one is unable to find the route. If the root is 
> process1, then process2 fails and vice versa.
> 
> I have attached a tar file with small example of this case. To observe the 
> above scenario, run the examples in the following way
> 
> 1. start 2 separate instances of "server"
>       mpirun -np 1 ./server
>       mpirun -np 1 ./server
> 
> 2. They will print out the portname. Copy and paste the portname in client.c 
> (in strcpy)
> 3. Compile client.c and start the client
>       mpirun -np 1 ./client
> 
> 4. You will see output of the first server (which is process2) during the 
> final MPI_Comm_connect as
> 
> [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to 
> [[8117,1],0]:16, can't find route
> [0] func:0   libopen-rte.2.dylib                 0x0000000100055afb 
> opal_backtrace_print + 43
> [1] func:1   mca_rml_oob.so                      0x000000010017aec3 
> rml_oob_recv_route_callback + 739
> [2] func:2   mca_oob_tcp.so                      0x0000000100187ab9 
> mca_oob_tcp_msg_recv_complete + 825
> [3] func:3   mca_oob_tcp.so                      0x0000000100188ddd 
> mca_oob_tcp_peer_recv_handler + 397
> [4] func:4   libopen-rte.2.dylib                 0x0000000100064a55 
> opal_event_base_loop + 837
> [5] func:5   mpirun                              0x00000001000018d1 orterun + 
> 3428
> [6] func:6   mpirun                              0x0000000100000b6b main + 27
> [7] func:7   mpirun                              0x0000000100000b48 start + 52
> [8] func:8   ???                                 0x0000000000000004 0x0 + 4
> 
> Note that just to make this example simple, I am not using any publish/lookup 
> and I am manually copying the portnames. 
> 
> Can someone please look into this problem? we really want to use this for a 
> project but restricted by this bug.
> 
> Thanks!
> Best,
> Suraj
> 
> <ac-test.tar>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to