Ralph,On our quest for better shared memory collective, we did some runs with 16 cores Intel machines. The SM worked as expected, as far as I can tell. Unfortunately we only have one such node, so we never tried more than 16 processes.
george. On Jul 24, 2008, at 11:13 PM, Ralph Castain wrote:
Yo folksWe are trying to run some tests on a new cluster and are having a problem telling hardware, system software, and OMPI failures apart. This is a 16-ppn Opteron system running SLURM under RHEL (forget the precise version), with IB and OMPI 1.2.6.Everything launches just fine and seems to work okay. However, on large jobs (e.g., >450 procs), the IMB tests fail and crash a bunch of the nodes on which they are running.Has anyone else been able to test in 16+ ppn configurations? I'm wondering if we have an SM problem - perhaps inadequate backing file space or something?Any suggestions on how to debug this or config options for higher ppn systems would be appreciated. We don't see this problem on anything with lesser ppn. I'm going to give it a try with 1.3 and see what happens there.Thanks Ralph _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
smime.p7s
Description: S/MIME cryptographic signature