Hi.

I think it is not necessary to specify the hosts via the hostfile using SGE 
and OpenMPI, even the $NSLOTS is not necessary , just run 
mpirun executable 
this works very well.

to your memory problem:
I had similar problems when I specified the h_vmem option to use in SGE. 
Without SGE everything works, but starting with SGE gives such memory errors.
You can easily check this with 'qconf -sc'. If you have used this option, try 
without it. The problem in my case was that OpenMPI allocates sometimes a lot 
of memory and the job gets immediately killed by SGE, and one gets such error 
messages, see my posting some days ago. I am not sure if this helps in your 
case but it could be an explanation.

Markus



Am Donnerstag, 21. Juni 2007 15:26 schrieb sad...@gmx.net:
> Hi,
>
> I'm having some really strange error causing me some serious headaches.
> I want to integrate OpenMPI version 1.1.1 from the OFED package version
> 1.1 with SGE version 6.0. For mvapich all works, but for OpenMPI not ;(.
> Here is my jobfile and error message:
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -np $NSLOTS -hostfile
> $TMPDIR/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> ERRORMESSAGE:
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25768] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25769] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25770] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,102400)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [node04:25771] mca_mpool_openib_register: ibv_reg_mr(0x584000,528384)
> failed with error: Cannot allocate memory
> [0,1,1][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating
> low priority cq for mthca0 errno says Cannot allocate memory
>
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> MPI_Job.e111975 (END)
>
>
> If I run the OMPI job just with out SGE => everything works e.g. the
> following command:
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> If I do this with static machinefiles, it works too:
> $ cat /tmp/machines
> node04
> node04
> node04
> node04
>
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -hostfile
> /tmp/machines /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> And if I run this in a jobscript it works even with a static machinefile
> (not shown below):
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> export PATH=$PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/ofed/mpi/gcc/openmpi-1.1.1.-1/lib64
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/bin/mpirun -v -np 4 -H
> node04,node04,node04,node04
> /usr/ofed/mpi/gcc/openmpi-1.1.1-1/tests/IMB-2.3/IMB-MPI1
>
> There are correct ulimits for all nodes in the cluster e.g. node04:
> -sh-3.00$ ssh node04 ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1024
> max locked memory       (kbytes, -l) 8162952
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 139264
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> And the infiniband seems to have no troubles at all:
> -sh-3.00$ ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.0.800
>         Hardware version: a0
>         Node GUID: 0x0002c90200220ac8
>         System image GUID: 0x0002c90200220acb
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 18
>                 LMC: 0
>                 SM lid: 1
>                 Capability mask: 0x02510a68
>                 Port GUID: 0x0002c90200220ac9Hi,
>
> Thanks for any suggestions..
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to