what version of ompi are you referring to? On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote:
> Dear all, > > I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. > In short, after two processes connect to each other with a port and merge to > create a intra comm (rank 0 and rank 1), only one of them (the root) is > thereafter able to reach a third new process through MPI_Comm_connect. > > I can explain with an example: > > 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - > process1, process2 and process3. > 2. Process2 opens a port and waits in MPI_Comm_accept > 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and creates > a inter-comm. > 4. Process1 and process2 participate in MPI_Intercomm_merge and create a > intra-comm (say, newcomm). > 5. Process3 has also opened a port and is now waiting at MPI_Comm_accept > 6. Process1 and process2 try to connect to process3 with > MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm) > > At this stage only the root process from newcomm is able to connect to > process3 and the other one is unable to find the route. If the root is > process1, then process2 fails and vice versa. > > I have attached a tar file with small example of this case. To observe the > above scenario, run the examples in the following way > > 1. start 2 separate instances of "server" > mpirun -np 1 ./server > mpirun -np 1 ./server > > 2. They will print out the portname. Copy and paste the portname in client.c > (in strcpy) > 3. Compile client.c and start the client > mpirun -np 1 ./client > > 4. You will see output of the first server (which is process2) during the > final MPI_Comm_connect as > > [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to > [[8117,1],0]:16, can't find route > [0] func:0 libopen-rte.2.dylib 0x0000000100055afb > opal_backtrace_print + 43 > [1] func:1 mca_rml_oob.so 0x000000010017aec3 > rml_oob_recv_route_callback + 739 > [2] func:2 mca_oob_tcp.so 0x0000000100187ab9 > mca_oob_tcp_msg_recv_complete + 825 > [3] func:3 mca_oob_tcp.so 0x0000000100188ddd > mca_oob_tcp_peer_recv_handler + 397 > [4] func:4 libopen-rte.2.dylib 0x0000000100064a55 > opal_event_base_loop + 837 > [5] func:5 mpirun 0x00000001000018d1 orterun + > 3428 > [6] func:6 mpirun 0x0000000100000b6b main + 27 > [7] func:7 mpirun 0x0000000100000b48 start + 52 > [8] func:8 ??? 0x0000000000000004 0x0 + 4 > > Note that just to make this example simple, I am not using any publish/lookup > and I am manually copying the portnames. > > Can someone please look into this problem? we really want to use this for a > project but restricted by this bug. > > Thanks! > Best, > Suraj > > <ac-test.tar>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel