Dear all, I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. In short, after two processes connect to each other with a port and merge to create a intra comm (rank 0 and rank 1), only one of them (the root) is thereafter able to reach a third new process through MPI_Comm_connect.
I can explain with an example: 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - process1, process2 and process3. 2. Process2 opens a port and waits in MPI_Comm_accept 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates a inter-comm. 4. Process1 and process2 participate in MPI_Intercomm_merge and create a intra-comm (say, newcomm). 5. Process3 has also opened a port and is now waiting at MPI_Comm_accept 6. Process1 and process2 try to connect to process3 with MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm) At this stage only the root process from newcomm is able to connect to process3 and the other one is unable to find the route. If the root is process1, then process2 fails and vice versa. I have attached a tar file with small example of this case. To observe the above scenario, run the examples in the following way 1. start 2 separate instances of "server" mpirun -np 1 ./server mpirun -np 1 ./server 2. They will print out the portname. Copy and paste the portname in client.c (in strcpy) 3. Compile client.c and start the client mpirun -np 1 ./client 4. You will see output of the first server (which is process2) during the final MPI_Comm_connect as [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to [[8117,1],0]:16, can't find route [0] func:0 libopen-rte.2.dylib 0x0000000100055afb opal_backtrace_print + 43 [1] func:1 mca_rml_oob.so 0x000000010017aec3 rml_oob_recv_route_callback + 739 [2] func:2 mca_oob_tcp.so 0x0000000100187ab9 mca_oob_tcp_msg_recv_complete + 825 [3] func:3 mca_oob_tcp.so 0x0000000100188ddd mca_oob_tcp_peer_recv_handler + 397 [4] func:4 libopen-rte.2.dylib 0x0000000100064a55 opal_event_base_loop + 837 [5] func:5 mpirun 0x00000001000018d1 orterun + 3428 [6] func:6 mpirun 0x0000000100000b6b main + 27 [7] func:7 mpirun 0x0000000100000b48 start + 52 [8] func:8 ??? 0x0000000000000004 0x0 + 4 Note that just to make this example simple, I am not using any publish/lookup and I am manually copying the portnames. Can someone please look into this problem? we really want to use this for a project but restricted by this bug. Thanks! Best, Suraj
ac-test.tar
Description: Unix tar archive