Hello, I am debugging some sort of deadlock when doing multirail over Open-MX. What I am seeing with 2 processes and 2 boards per node with *MX* is: 1) process 0 rail 0 connects to process 1 rail 0 2) p1r0 connects back to p0r0 3) p0 rail 1 connects to p1 rail 1 4) p1r1 connects back to p0r1 For some reason, with *Open-MX*, process 0 seems to start (3) before process 1 has finished (2). It probably causes a deadlock because p1 is polling on rail 0 for (2), while (3) needs somebody to poll on rail 1 for the connect handshake.
So, the question is: is there anything in OMPI (1.3) guarantying that the above 4 steps will occur in some specified order? If so, Open-MX is probably doing something wrong breaking the order. If not, adding a progression thread to Open-MX might be the only solution... thanks, Brice