Dear all,

I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. 
In short, after two processes connect to each other with a port and merge to 
create a intra comm (rank 0 and rank 1), only one of them (the root) is 
thereafter able to reach a third new process through MPI_Comm_connect.

I can explain with an example:

1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - 
process1, process2 and process3.
2. Process2 opens a port and waits in MPI_Comm_accept
3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates a 
inter-comm.
4. Process1 and process2 participate in MPI_Intercomm_merge and create a 
intra-comm (say, newcomm).
5. Process3 has also opened a port and  is now waiting at MPI_Comm_accept
6. Process1 and process2 try to connect to process3 with MPI_Comm_connect(port, 
..., root, newcomm, new_all3_inter_comm)

At this stage only the root process from newcomm is able to connect to process3 
and the other one is unable to find the route. If the root is process1, then 
process2 fails and vice versa.

I have attached a tar file with small example of this case. To observe the 
above scenario, run the examples in the following way

1. start 2 separate instances of "server"
        mpirun -np 1 ./server
        mpirun -np 1 ./server

2. They will print out the portname. Copy and paste the portname in client.c 
(in strcpy)
3. Compile client.c and start the client
        mpirun -np 1 ./client

4. You will see output of the first server (which is process2) during the final 
MPI_Comm_connect as

[[8119,0],0]:route_callback tried routing message from [[8119,1],0] to 
[[8117,1],0]:16, can't find route
[0] func:0   libopen-rte.2.dylib                 0x0000000100055afb 
opal_backtrace_print + 43
[1] func:1   mca_rml_oob.so                      0x000000010017aec3 
rml_oob_recv_route_callback + 739
[2] func:2   mca_oob_tcp.so                      0x0000000100187ab9 
mca_oob_tcp_msg_recv_complete + 825
[3] func:3   mca_oob_tcp.so                      0x0000000100188ddd 
mca_oob_tcp_peer_recv_handler + 397
[4] func:4   libopen-rte.2.dylib                 0x0000000100064a55 
opal_event_base_loop + 837
[5] func:5   mpirun                              0x00000001000018d1 orterun + 
3428
[6] func:6   mpirun                              0x0000000100000b6b main + 27
[7] func:7   mpirun                              0x0000000100000b48 start + 52
[8] func:8   ???                                 0x0000000000000004 0x0 + 4

Note that just to make this example simple, I am not using any publish/lookup 
and I am manually copying the portnames. 

Can someone please look into this problem? we really want to use this for a 
project but restricted by this bug.

Thanks!
Best,
Suraj

Attachment: ac-test.tar
Description: Unix tar archive

Reply via email to