Hi. I think it is not necessary to specify the hosts via the hostfile using SGE and OpenMPI, even the $NSLOTS is not necessary , just run mpirun executable this works very well.
to your memory problem: I had similar problems when I specified the h_vmem option to use in SGE. Without SGE everything works, but starting with SGE gives such memory errors. You can easily check this with 'qconf -sc'. If you have used this option, try without it. The problem in my case was that OpenMPI allocates sometimes a lot of memory and the job gets immediately killed by SGE, and one gets such error messages, see my posting some days ago. I am not sure if this helps in your case but it could be an explanation. Markus Am Donnerstag, 21. Juni 2007 15:26 schrieb sad...@gmx.net: > Hi, > > I'm having some really strange error causing me some serious headaches. > I want to integrate OpenMPI version 1.1.1 from the OFED package version > 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(. > Here is my jobfile and error message: > #!/bin/csh -f > #$ -N MPI_Job > #$ -pe mpi 4 > export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin > export > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64 > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile > $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 > > ERRORMESSAGE: > [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) > failed with error: Cannot allocate memory > [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) > failed with error: Cannot allocate memory > [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating > low priority cq for mthca0 errno says Cannot allocate memory > > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Error" (-1) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > MPI_Job.e111975 (END) > > > If I run the OMPI job just with out SGE => everything works e.g. the > following command: > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H > node04,node04,node04,node04 > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 > > If I do this with static machinefiles, it works too: > $ cat /tmp/machines > node04 > node04 > node04 > node04 > > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile > /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 > > And if I run this in a jobscript it works even with a static machinefile > (not shown below): > #!/bin/csh -f > #$ -N MPI_Job > #$ -pe mpi 4 > export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin > export > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64 > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H > node04,node04,node04,node04 > /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 > > There are correct ulimits for all nodes in the cluster e.g. node04: > -sh-3.00$ ssh node04 ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals (-i) 1024 > max locked memory (kbytes, -l) 8162952 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 139264 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > And the infiniband seems to have no troubles at all: > -sh-3.00$ ibstat > CA 'mthca0' > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.0.800 > Hardware version: a0 > Node GUID: 0x0002c90200220ac8 > System image GUID: 0x0002c90200220acb > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 18 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x0002c90200220ac9Hi, > > Thanks for any suggestions.. > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel