Yo folks

We are trying to run some tests on a new cluster and are having a problem telling hardware, system software, and OMPI failures apart. This is a 16-ppn Opteron system running SLURM under RHEL (forget the precise version), with IB and OMPI 1.2.6.

Everything launches just fine and seems to work okay. However, on large jobs (e.g., >450 procs), the IMB tests fail and crash a bunch of the nodes on which they are running.

Has anyone else been able to test in 16+ ppn configurations? I'm wondering if we have an SM problem - perhaps inadequate backing file space or something?

Any suggestions on how to debug this or config options for higher ppn systems would be appreciated. We don't see this problem on anything with lesser ppn. I'm going to give it a try with 1.3 and see what happens there.

Thanks
Ralph

Reply via email to