Hi, I'm having some really strange error causing me some serious headaches. I want to integrate OpenMPI version 1.1.1 from the OFED package version 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(. Here is my jobfile and error message: #!/bin/csh -f #$ -N MPI_Job #$ -pe mpi 4 export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64 /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
ERRORMESSAGE: [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400) failed with error: Cannot allocate memory [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384) failed with error: Cannot allocate memory [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating low priority cq for mthca0 errno says Cannot allocate memory -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) MPI_Job.e111975 (END) If I run the OMPI job just with out SGE => everything works e.g. the following command: /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H node04,node04,node04,node04 /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 If I do this with static machinefiles, it works too: $ cat /tmp/machines node04 node04 node04 node04 /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 And if I run this in a jobscript it works even with a static machinefile (not shown below): #!/bin/csh -f #$ -N MPI_Job #$ -pe mpi 4 export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64 /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H node04,node04,node04,node04 /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1 There are correct ulimits for all nodes in the cluster e.g. node04: -sh-3.00$ ssh node04 ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) 8162952 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited And the infiniband seems to have no troubles at all: -sh-3.00$ ibstat CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.0.800 Hardware version: a0 Node GUID: 0x0002c90200220ac8 System image GUID: 0x0002c90200220acb Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 18 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x0002c90200220ac9Hi, Thanks for any suggestions..