Okay, yeah - that's a known problem that has previously been discussed on the user list. Still hasn't been fixed, though it is on the short list for repair (probably in the 1.7 series).
See bug tracker here: https://svn.open-mpi.org/trac/ompi/ticket/2904 On Sep 12, 2012, at 10:12 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> wrote: > I am referring to v1.6. > > On Sep 12, 2012, at 5:27 PM, Ralph Castain wrote: > >> what version of ompi are you referring to? >> >> On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> >> wrote: >> >>> Dear all, >>> >>> I observed a strange behavior with MPI_Comm_connect and >>> MPI_Comm_disconnect. >>> In short, after two processes connect to each other with a port and merge >>> to create a intra comm (rank 0 and rank 1), only one of them (the root) is >>> thereafter able to reach a third new process through MPI_Comm_connect. >>> >>> I can explain with an example: >>> >>> 1. Assume 3 MPI programs with separate MPI_COMM_WORLD with size=1,rank=0 - >>> process1, process2 and process3. >>> 2. Process2 opens a port and waits in MPI_Comm_accept >>> 3. Process1 connects to process2 with MPI_Comm_connect(port,...) and >>> creates a inter-comm. >>> 4. Process1 and process2 participate in MPI_Intercomm_merge and create a >>> intra-comm (say, newcomm). >>> 5. Process3 has also opened a port and is now waiting at MPI_Comm_accept >>> 6. Process1 and process2 try to connect to process3 with >>> MPI_Comm_connect(port, ..., root, newcomm, new_all3_inter_comm) >>> >>> At this stage only the root process from newcomm is able to connect to >>> process3 and the other one is unable to find the route. If the root is >>> process1, then process2 fails and vice versa. >>> >>> I have attached a tar file with small example of this case. To observe the >>> above scenario, run the examples in the following way >>> >>> 1. start 2 separate instances of "server" >>> mpirun -np 1 ./server >>> mpirun -np 1 ./server >>> >>> 2. They will print out the portname. Copy and paste the portname in >>> client.c (in strcpy) >>> 3. Compile client.c and start the client >>> mpirun -np 1 ./client >>> >>> 4. You will see output of the first server (which is process2) during the >>> final MPI_Comm_connect as >>> >>> [[8119,0],0]:route_callback tried routing message from [[8119,1],0] to >>> [[8117,1],0]:16, can't find route >>> [0] func:0 libopen-rte.2.dylib 0x0000000100055afb >>> opal_backtrace_print + 43 >>> [1] func:1 mca_rml_oob.so 0x000000010017aec3 >>> rml_oob_recv_route_callback + 739 >>> [2] func:2 mca_oob_tcp.so 0x0000000100187ab9 >>> mca_oob_tcp_msg_recv_complete + 825 >>> [3] func:3 mca_oob_tcp.so 0x0000000100188ddd >>> mca_oob_tcp_peer_recv_handler + 397 >>> [4] func:4 libopen-rte.2.dylib 0x0000000100064a55 >>> opal_event_base_loop + 837 >>> [5] func:5 mpirun 0x00000001000018d1 orterun >>> + 3428 >>> [6] func:6 mpirun 0x0000000100000b6b main + >>> 27 >>> [7] func:7 mpirun 0x0000000100000b48 start + >>> 52 >>> [8] func:8 ??? 0x0000000000000004 0x0 + 4 >>> >>> Note that just to make this example simple, I am not using any >>> publish/lookup and I am manually copying the portnames. >>> >>> Can someone please look into this problem? we really want to use this for a >>> project but restricted by this bug. >>> >>> Thanks! >>> Best, >>> Suraj >>> >>> <ac-test.tar>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel