Hi Brian,

Have a look at
http://technical.bestgrid.org/index.php/Setup_GRAM5_on_CentOS_5#Increase_Open_Files_Limit

Cheers,
Yuriy

On 23/01/12 18:45, Brian O'Connor wrote:
Hi,

I've been using GRAM for a long time now and I'd like to push it into
production but I'm having issues with it.  I submit workflows of
hundreds of jobs each day through an automated submitter so I need to
be able to send jobs to a GRAM server and not have it get in a bad
state after x number of days.  That's the goal at least...

Anyway, the latest problem I've had is with GRAM rejecting incoming
requests because of "Too many open files"

Here's the error:

globus-job-run server.domain.name/jobmanager-sge /bin/hostname

GRAM Job submission failed because Error opening proxy file for
writing: 
/u/seqware/.globus/job/sqwprod.hpc.oicr.on.ca/16217884770066032596.5836665131371726474/x509_user_proxy:
Too many open files (24) (error code 75)

I checked my proxy and it looks OK:

grid-proxy-info
subject  : 
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware/CN=1800547271
issuer   : 
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
identity : 
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
type     : RFC 3820 compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u1373
timeleft : 479:16:03  (20.0 days)

I then looked at the number of open files for this user:

  /usr/sbin/lsof  | grep seqware | wc -l
2084

Looking at the globus-job-manager it's using up the majority:

ps aux | grep globus-job-man
seqware   175028  0.0  0.0  61200   768 pts/2    R+   00:21   0:00
grep globus-job-man
seqware  4103600  0.1  0.4 116984 18628 ?        S    Jan22   1:26
globus-job-manager -conf
/usr/local/globus/default/etc/globus-job-manager.conf -type sge
seqware  4103647  0.0  0.1  36548  7440 ?        S    Jan22   1:00
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103649  0.0  0.1  36548  7456 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103650  0.0  0.1  36548  7440 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103651  0.0  0.1  36548  7444 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103652  0.0  0.1  36544  7436 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive

/usr/sbin/lsof  | grep seqware | grep 4103600 | wc -l
1069

However if I look at this users limits it looks like they can open up
to 32768 files and I can perform other file operations just fine.

  ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 69632
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 69632
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Does anyone know why this is happening?  To date I've been killing the
globus-job-manager when things like this happen.  Is there a guide
somewhere that describes the right way to reset the daemons if
something goes wrong?  Is there a guide for avoiding common pitfalls
and setting up GRAM (in particular) to work in a heavily used grid
install?  I want to be able to push thousands of jobs through the
system but so far it seems to barf on me every few days which has
caused a lot of disruption in our workflows.

I'm currently using 5.0.2, I would like to upgrade but it requires the
IT group to authorize this.  Here's my configuration for the
gatekeeper:

service gsigatekeeper
{
     socket_type      = stream
     wait             = no
     user             = root
     server           = /usr/local/globus/default/sbin/globus-gatekeeper
     server_args      = -conf
/usr/local/globus/default/etc/globus-gatekeeper.conf
     env             += LD_LIBRARY_PATH=/usr/local/globus/default/lib
     env             += GLOBUS_LOCATION=/usr/local/globus/default
     env             += GLOBUS_TCP_PORT_RANGE=40000,41000
     env             += GLOBUS_HOSTNAME=server.domain.name
     env             +=SGE_QMASTER_PORT=6444
     log_on_failure  += USERID
     nice             = 0
     instances        = 100
     max_load         = 200.0
     disable          = no
}


Thanks for your help

--Brian O'Connor

Reply via email to