Two things:

1. You might want to update your version of Open MPI if possible; the v1.1.1 version is quite old. We have added many new bug fixes and features since v1.1.1 (including tight SGE integration). There is nothing special about the Open MPI that is included in the OFED distribution; you can download a new version from the Open MPI web site (the current stable version is v1.2.3), configure, compile, and install it with your current OFED installation. You should be able to configure Open MPI with:

        ./configure --with-openib=/usr/local/ofed ...

(assuming you chose the default location to install OFED) You'll probably also want to specify a --prefix to install Open MPI to a specific location, etc.

2. I know little/nothing about SGE, but I'm assuming that you need to have SGE pass the proper memory lock limits to new processes. In an interactive login, you showed that the max limit is "8162952" -- you might just want to make it unlimited, unless you have a reason for limiting it. See http://www.open-mpi.org/faq/? category=openfabrics#limiting-registered-memory-usage for details. Additionally, I *assume* that running under SGE will set different memory locked limits (most resource managers do) than running under interactive jobs. You need to find out how to set the memory locked limits for jobs running under SGE; I'd suggest making the value be "unlimited".



On Jun 21, 2007, at 9:26 AM, sad...@gmx.net wrote:

Hi,

I'm having some really strange error causing me some serious headaches. I want to integrate OpenMPI version 1.1.1 from the OFED package version 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
Here is my jobfile and error message:
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/ lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/ IMB-MPI1

ERRORMESSAGE:
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
low priority cq for mthca0 errno says Cannot allocate memory

---------------------------------------------------------------------- ---- It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
---------------------------------------------------------------------- ----
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
MPI_Job.e111975 (END)


If I run the OMPI job just with out SGE => everything works e.g. the
following command:
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

If I do this with static machinefiles, it works too:
$ cat /tmp/machines
node04
node04
node04
node04

/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
/tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

And if I run this in a jobscript it works even with a static machinefile
(not shown below):
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/ lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1

There are correct ulimits for all nodes in the cluster e.g. node04:
-sh-3.00$ ssh node04 ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 8162952
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And the infiniband seems to have no troubles at all:
-sh-3.00$ ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.800
        Hardware version: a0
        Node GUID: 0x0002c90200220ac8
        System image GUID: 0x0002c90200220acb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 18
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200220ac9Hi,

Thanks for any suggestions..

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to