Re: [OMPI devel] large virtual memory consumption on smp nodes and gridengine problems

Ralph Castain Sun, 10 Jun 2007 18:04:10 -0400

Hi Markus

There are two MCA params that can help you, I believe:


1. You to set the maximum size of the shared memory file with

-mca mpool_sm_max_size xxx

where xxx is the maximum memory file you want, expressed in bytes. The
default value I see is 512MBytes.

2. You can set the size/peer of the file, again in bytes:

-mca mpool_sm_per_peer_size xxx

This will allocate a file that is xxx * num_procs_on_the_node on each node,
up to the maximum file size (either the 512MB default or whatever you
specified using the previous param). This defaults to 32MBytes/proc.


I see that there is also a minimum (total, not per-proc) file size that
defaults to 128MBytes. If that is still too large, you can adjust it using

-mca mpool_sm_min_size yyy


Hope that helps
Ralph



On 6/10/07 2:55 PM, "Markus Daene" <markus.da...@physik.uni-halle.de> wrote:

> Dear all,
> 
> I hope I am in the correct mailing list with my problem.
> I try to run openmpi with the gridengine(6.0u10, 6.1). Therefore I
> compiled openmpi (1.2.2),
> which has the gridengine support included, I have checked it with ompi_info.
> In principle, openmpi runs well.
> The gridengine is configured such that the user has to specify the
> memory consumption
> via the h_vmem option. Then I noticed that with a larger number of
> processes the job
> is killed by the gridengine because of taking too much memory.
> To take a closer look on that, I wrote a small and simple (Fortran) MPI
> program which has just a MPI_Init
> and a (static) array, in my case of 50MB, then the programm goes into a
> (infinite) loop, because it
> takes some time until the gridengine reports the maxvmem.
> I found, that if the processes run all on different nodes, there is only
> a offset per process, at least
> a linear scaling. But it becomes worse when the jobs run on one node.
> There it seems to be a quadratic
> scaling with the offset, in my case about 30M. I made a list of the
> virtual memory reported by the
> gridengine, I was running on a 16 processor node:
> 
> #N proc    virt. Mem[MB]
> 1          182
> 2          468
> 3          825
> 4          1065
> 5          1001
> 6          1378
> 7          1817
> 8          2303
> 12         4927
> 16         8559
> 
> the pure program should need N*50MB, for 16 it is only 800M, but it
> takes 10 times more, >7GB!!!
> Of course, the gridengine will kills the job is this overtaking is not
> taken into accout,
> because of too much virtual memory consumption. The momory consumtion is
> not related to the grid engine,
> it is the same if I run from the command line.
> I guess it might be related to the 'sm' component of the btl.
> Is it possible to avoid the quadratic scaling?
> Of course I could use the vapi/tcp component only like
> mpirun --mca btl mvapi  -np 16 ./my_test_program
> in this case the virtual memory is fine, but it will not be what one
> wants on a smp node.
> 
> 
> then it becomes ever worse:
> openmpi nicely report the (max./act.) used virtual memory to the grid
> engine as sum of all processes.
> This value is the compared with the one the user has specified with the
> h_vmem option, but the
> gridengine takes this value per process for the allocation of the job
> (works) and does not multiply
> this with the number of processes. Maybe one should report this to the
> gridenging mailing list, but it
> could be related as well for the openmpi interface.
> 
> The last thing I noticed:
> It seems that if the v_mem option for gridengine jobs is specified like
> '2.0G' my test job was
> immedialtely killed; but when I specify '2000M' (which is obviously
> less) it work. The gridengine
> puts the job allways on the correct node as requested, but I think there
> is might be a problem in
> the openmpi interface.
> 
> 
> It would be nice if someone could give some hints how to avoid the
> quadratic scaling or maybe to think
> if this is really neccessary in openmpi.
> 
> 
> Thanks.
> Markus Daene
> 
> 
> 
> 
> my compiling options:
> ./configure --prefix=/not_important --enable-static
> --with-f90-size=medium --with-f90-max-array-dim=7  --with-mpi-para
> m-check=always --enable-cxx-exceptions --with-mvapi
> --enable-mca-no-build=btl-tcp
> 
> ompi_info output:
>                 Open MPI: 1.2.2
>    Open MPI SVN revision: r14613
>                 Open RTE: 1.2.2
>    Open RTE SVN revision: r14613
>                     OPAL: 1.2.2
>        OPAL SVN revision: r14613
>                   Prefix: /usrurz/openmpi/1.2.2/pathscale_3.0
>  Configured architecture: x86_64-unknown-linux-gnu
>            Configured by: root
>            Configured on: Mon Jun  4 16:04:38 CEST 2007
>           Configure host: GE1N01
>                 Built by: root
>                 Built on: Mon Jun  4 16:09:37 CEST 2007
>               Built host: GE1N01
>               C bindings: yes
>             C++ bindings: yes
>       Fortran77 bindings: yes (all)
>       Fortran90 bindings: yes
>  Fortran90 bindings size: small
>               C compiler: pathcc
>      C compiler absolute: /usrurz/pathscale/bin/pathcc
>             C++ compiler: pathCC
>    C++ compiler absolute: /usrurz/pathscale/bin/pathCC
>       Fortran77 compiler: pathf90
>   Fortran77 compiler abs: /usrurz/pathscale/bin/pathf90
>       Fortran90 compiler: pathf90
>   Fortran90 compiler abs: /usrurz/pathscale/bin/pathf90
>              C profiling: yes
>            C++ profiling: yes
>      Fortran77 profiling: yes
>      Fortran90 profiling: yes
>           C++ exceptions: yes
>           Thread support: posix (mpi: no, progress: no)
>   Internal debug support: no
>      MPI parameter check: always
> Memory profiling support: no
> Memory debugging support: no
>          libltdl support: yes
>    Heterogeneous support: yes
>  mpirun default --prefix: no
>           MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.2)
>               MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.2)
>            MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.2)
>            MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.2)
>            MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.2)
>                MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.2)
>          MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.2)
>          MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.2)
>            MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>            MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>                 MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.2)
>                 MCA coll: self (MCA v1.0, API v1.0, Component v1.2.2)
>                 MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.2)
>                 MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.2)
>                   MCA io: romio (MCA v1.0, API v1.0, Component v1.2.2)
>                MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.2)
>                MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.2)
>               MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.2)
>                  MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.2)
>                  MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.2)
>                 MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.2)
>               MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.2)
>               MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.2)
>               MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.2)
>                   MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.2)
>                   MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.2)
>                  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>                  MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.2)
>               MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.2)
>                 MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.2)
>                 MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.2)
>                  MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.2)
>                  MCA sds: env (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.2)
>                  MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.2)
> 
> ----------------------------------------------------------
> Markus Daene
> Martin Luther University Halle-Wittenberg
> Naturwissenschaftliche Fakultaet II
> Institute of Physics
> Von Seckendorff-Platz 1 (room 1.28)
> 06120 Halle
> Germany
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] large virtual memory consumption on smp nodes and gridengine problems

Reply via email to