Re: [OMPI devel] (loose) SGE Integration fails, why?

sadfub Mon, 25 Jun 2007 03:04:38 -0400

Sorry for late reply, but I havent had access to the machine at the weekend.


> I don't really know what this means.  People have explained "loose"  
> vs. "tight" integration to me before, but since I'm not an SGE user,  
> the definitions always fall away.

I *assume* loose coupled jobs, are just jobs, where the SGE find some
nodes to process them and from then on, it doesn't care about anything
in conjunction to the jobs. In contrast to tight coupled jobs, where the
SGE take care for sub process which could spwan from the job and
terminate them too in case of a failure, or take care of specified
resources.

> Based on your prior e-mail, it looks like you are always invoking  
> "ulimit" via "pdsh", even under SGE jobs.  This is incorrect. 

why?

> Can't you just submit an SGE job script that runs "ulimit"?

#!/bin/csh -f
#$ -N MPI_Job
#$ -pe mpi 4
hostname && ulimit -a

ATM I'm quite confused: cause I want to use the c-shell, but ulimit is
just for bash. The c-shell uses limit... hmm.. and SGE uses obviously
bash, instead of my request for csh in the first line. But if I just use
#!/bin/bash I get the same limits:

-sh-3.00$ cat MPI_Job.o112116
node02
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

oops => 32 kbytes... So this isn't OMPI's fault.

>>> What are the limits of the user that launches the SGE daemons?  I.e.,
>>> did the SGE daemons get started with proper "unlimited" limits?  If
>>> not, that could hamper SGE's ability to set the limits that you told
>> The limits in /etc/security/limits.conf apply to all users (using a
>> '*'), hence the SGE processes and deamons shouldn't have any limits.
> 
> Not really.  limits.conf is not universally applied; it's a PAM  
> entity.  So for daemons that start via /etc/init.d scripts (or  
> whatever the equivalent is on your system), PAM limits are not  
> necessarily applied.  For example, I had to manually insert a "ulimit  
> -Hl unlimited" in the startup script for my SLURM daemons.

Hmm, ATM there are some important jobs in the queue (started some MONTHS
ago) so I cannot restart the daemon. Is there any other way than restart
(with proper limits) for ensuring the limits of a process?


thanks for your great help :)

Re: [OMPI devel] (loose) SGE Integration fails, why?

Reply via email to