Sorry for late reply, but I havent had access to the machine at the weekend.
> I don't really know what this means. People have explained "loose" > vs. "tight" integration to me before, but since I'm not an SGE user, > the definitions always fall away. I *assume* loose coupled jobs, are just jobs, where the SGE find some nodes to process them and from then on, it doesn't care about anything in conjunction to the jobs. In contrast to tight coupled jobs, where the SGE take care for sub process which could spwan from the job and terminate them too in case of a failure, or take care of specified resources. > Based on your prior e-mail, it looks like you are always invoking > "ulimit" via "pdsh", even under SGE jobs. This is incorrect. why? > Can't you just submit an SGE job script that runs "ulimit"? #!/bin/csh -f #$ -N MPI_Job #$ -pe mpi 4 hostname && ulimit -a ATM I'm quite confused: cause I want to use the c-shell, but ulimit is just for bash. The c-shell uses limit... hmm.. and SGE uses obviously bash, instead of my request for csh in the first line. But if I just use #!/bin/bash I get the same limits: -sh-3.00$ cat MPI_Job.o112116 node02 core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited oops => 32 kbytes... So this isn't OMPI's fault. >>> What are the limits of the user that launches the SGE daemons? I.e., >>> did the SGE daemons get started with proper "unlimited" limits? If >>> not, that could hamper SGE's ability to set the limits that you told >> The limits in /etc/security/limits.conf apply to all users (using a >> '*'), hence the SGE processes and deamons shouldn't have any limits. > > Not really. limits.conf is not universally applied; it's a PAM > entity. So for daemons that start via /etc/init.d scripts (or > whatever the equivalent is on your system), PAM limits are not > necessarily applied. For example, I had to manually insert a "ulimit > -Hl unlimited" in the startup script for my SLURM daemons. Hmm, ATM there are some important jobs in the queue (started some MONTHS ago) so I cannot restart the daemon. Is there any other way than restart (with proper limits) for ensuring the limits of a process? thanks for your great help :)