Two things:
1. You might want to update your version of Open MPI if possible; the
v1.1.1 version is quite old. We have added many new bug fixes and
features since v1.1.1 (including tight SGE integration). There is
nothing special about the Open MPI that is included in the OFED
distribution; you can download a new version from the Open MPI web
site (the current stable version is v1.2.3), configure, compile, and
install it with your current OFED installation. You should be able
to configure Open MPI with:
./configure --with-openib=/usr/local/ofed ...
(assuming you chose the default location to install OFED) You'll
probably also want to specify a --prefix to install Open MPI to a
specific location, etc.
2. I know little/nothing about SGE, but I'm assuming that you need to
have SGE pass the proper memory lock limits to new processes. In an
interactive login, you showed that the max limit is "8162952" -- you
might just want to make it unlimited, unless you have a reason for
limiting it. See http://www.open-mpi.org/faq/?
category=openfabrics#limiting-registered-memory-usage for details.
Additionally, I *assume* that running under SGE will set different
memory locked limits (most resource managers do) than running under
interactive jobs. You need to find out how to set the memory locked
limits for jobs running under SGE; I'd suggest making the value be
"unlimited".
On Jun 21, 2007, at 9:26 AM, sad...@gmx.net wrote:
Hi,
I'm having some really strange error causing me some serious
headaches.
I want to integrate OpenMPI version 1.1.1 from the OFED package
version
1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI
not ;(.
Here is my jobfile and error message:
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/
lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
$TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/
IMB-MPI1
ERRORMESSAGE:
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
failed with error: Cannot allocate memory
[0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
low priority cq for mthca0 errno says Cannot allocate memory
----------------------------------------------------------------------
----
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
----------------------------------------------------------------------
----
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
MPI_Job.e111975 (END)
If I run the OMPI job just with out SGE => everything works e.g. the
following command:
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
If I do this with static machinefiles, it works too:
$ cat /tmp/machines
node04
node04
node04
node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
/tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
And if I run this in a jobscript it works even with a static
machinefile
(not shown below):
#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/
lib64
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
node04,node04,node04,node04
/usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
There are correct ulimits for all nodes in the cluster e.g. node04:
-sh-3.00$ ssh node04 ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 8162952
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 139264
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
And the infiniband seems to have no troubles at all:
-sh-3.00$ ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.0.800
Hardware version: a0
Node GUID: 0x0002c90200220ac8
System image GUID: 0x0002c90200220acb
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 18
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID: 0x0002c90200220ac9Hi,
Thanks for any suggestions..
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems