Hi!

While running the IMB 3.1 with OpenMPI 1.2.7 over MX 1.2.7, I see hangs in most collective operations when testing with 16K buffers when running on 128 nodes, one MPI rank per node, with the default settings:

works:  PingPong, PingPing, Sendrecv, Exchange, Allreduce, Reduce,
        Reduce_scatter, Allgather, Bcast, Barrier

hangs after 8K: Allgatherv, Gather, Gatherv, Scatter, Scatterv,
                Alltoall, Alltoallv

("hangs after 8K" means that the results for 8K are printed, but those for 16K are not - I've allowed them several hours after the 8K results have been printed before killing the jobs; the processes continue to use CPU time, but no progress seems to be made). I've only recently been able to run the IMB on such large number of nodes, some lower level issues prevented me from running them before.

The IMB finishes successfully in the same conditions when run on 64 nodes. With Allgatherv I've found that the breaking point is somewhere around 90 nodes: it works with 88 nodes and it hangs with 90 nodes.

The above tests were performed with the default settings. When I specify '--mca mtl mx --mca pml cm' the IMB finishes successfully on 128 nodes; with MPICH-MX, the IMB also finishes successfully on 128 nodes. However, I consider it to be a big problem if the default OpenMPI settings lead to hangs.

Is this a known (but undocumented) behaviour ? Do other sites with a similar setup observe these hangs ? Can someone suggest what to do to avoid them or at least a way to debug this ?

Thanks in advance !

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de

Reply via email to