In the meantime, while i'm waiting for some contacts at Qlogic to get back to me, does anyone on the list have any idea what we might be running up against here?
OpenMPI 1.5.3 w/ PSMkey patch Slurm 2.2.4 Ofed 1.5.2 Rhel 5.6 (stock kernel) You can see this mailing list thread for details: http://www.open-mpi.org/community/lists/users/2011/04/16394.php but here's the meat... mpirun -n 2 -host node1 <prog> and my <prog> runs just fine, However if i do srun --resv-ports -n 2 -w node1 <prog> I get [node1:16780] PSM EP connect error (unknown connect error): [node1:16780] node1 [node1:16780] PSM EP connect error (Endpoint could not be reached): [node1:16780] node1 PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) I did notice a difference in the precondition env variable between the two runs mpirun -n 2 -host node1 <prog> sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which changes with each run (aka random)) srun --resv-ports -n 2 -w node1 <prog> sets precondition_transports=0000184500000000-0000000100000000 (which doesn't seem to change run to run) Can anyone decode the PSM EP and PML errors? Is there any rhyme to the psm key being random of some type over just a few values -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
