Hi Brian,
Have a look at
http://technical.bestgrid.org/index.php/Setup_GRAM5_on_CentOS_5#Increase_Open_Files_Limit
Cheers,
Yuriy
On 23/01/12 18:45, Brian O'Connor wrote:
Hi,
I've been using GRAM for a long time now and I'd like to push it into
production but I'm having issues with it. I submit workflows of
hundreds of jobs each day through an automated submitter so I need to
be able to send jobs to a GRAM server and not have it get in a bad
state after x number of days. That's the goal at least...
Anyway, the latest problem I've had is with GRAM rejecting incoming
requests because of "Too many open files"
Here's the error:
globus-job-run server.domain.name/jobmanager-sge /bin/hostname
GRAM Job submission failed because Error opening proxy file for
writing:
/u/seqware/.globus/job/sqwprod.hpc.oicr.on.ca/16217884770066032596.5836665131371726474/x509_user_proxy:
Too many open files (24) (error code 75)
I checked my proxy and it looks OK:
grid-proxy-info
subject :
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware/CN=1800547271
issuer :
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
identity :
/O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
type : RFC 3820 compliant impersonation proxy
strength : 512 bits
path : /tmp/x509up_u1373
timeleft : 479:16:03 (20.0 days)
I then looked at the number of open files for this user:
/usr/sbin/lsof | grep seqware | wc -l
2084
Looking at the globus-job-manager it's using up the majority:
ps aux | grep globus-job-man
seqware 175028 0.0 0.0 61200 768 pts/2 R+ 00:21 0:00
grep globus-job-man
seqware 4103600 0.1 0.4 116984 18628 ? S Jan22 1:26
globus-job-manager -conf
/usr/local/globus/default/etc/globus-job-manager.conf -type sge
seqware 4103647 0.0 0.1 36548 7440 ? S Jan22 1:00
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware 4103649 0.0 0.1 36548 7456 ? S Jan22 0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware 4103650 0.0 0.1 36548 7440 ? S Jan22 0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware 4103651 0.0 0.1 36548 7444 ? S Jan22 0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware 4103652 0.0 0.1 36544 7436 ? S Jan22 0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
/usr/sbin/lsof | grep seqware | grep 4103600 | wc -l
1069
However if I look at this users limits it looks like they can open up
to 32768 files and I can perform other file operations just fine.
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 69632
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 32768
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 69632
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Does anyone know why this is happening? To date I've been killing the
globus-job-manager when things like this happen. Is there a guide
somewhere that describes the right way to reset the daemons if
something goes wrong? Is there a guide for avoiding common pitfalls
and setting up GRAM (in particular) to work in a heavily used grid
install? I want to be able to push thousands of jobs through the
system but so far it seems to barf on me every few days which has
caused a lot of disruption in our workflows.
I'm currently using 5.0.2, I would like to upgrade but it requires the
IT group to authorize this. Here's my configuration for the
gatekeeper:
service gsigatekeeper
{
socket_type = stream
wait = no
user = root
server = /usr/local/globus/default/sbin/globus-gatekeeper
server_args = -conf
/usr/local/globus/default/etc/globus-gatekeeper.conf
env += LD_LIBRARY_PATH=/usr/local/globus/default/lib
env += GLOBUS_LOCATION=/usr/local/globus/default
env += GLOBUS_TCP_PORT_RANGE=40000,41000
env += GLOBUS_HOSTNAME=server.domain.name
env +=SGE_QMASTER_PORT=6444
log_on_failure += USERID
nice = 0
instances = 100
max_load = 200.0
disable = no
}
Thanks for your help
--Brian O'Connor