Hi,

I'm having some really strange error causing me some serious headaches.
I want to integrate OpenMPI version 1.1.1 from the OFED package version
1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
Here is my jobfile and error message:
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

ERRORMESSAGE:
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
low priority cq for mthca0 errno says Cannot allocate memory

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
MPI_Job.e111975 (END)


If I run the OMPI job just with out SGE => everything works e.g. the
following command:
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

If I do this with static machinefiles, it works too:
$ cat /tmp/machines
node04
node04
node04
node04

/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
/tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

And if I run this in a jobscript it works even with a static machinefile
(not shown below):
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

There are correct ulimits for all nodes in the cluster e.g. node04:
-sh-3.00$ ssh node04 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 8162952
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And the infiniband seems to have no troubles at all:
-sh-3.00$ ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.800
        Hardware version: a0
        Node GUID: 0x0002c90200220ac8
        System image GUID: 0x0002c90200220acb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 18
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200220ac9Hi,

Thanks for any suggestions..

Reply via email to