Hi Brian,

I think I'd try to convince the IT group to authorize the upgrade to GT 5.2.
According to http://www.globus.org/toolkit/docs/5.2/5.2.0/gram5/rn/#gram5-fixed the issue with accumulating open files (http://jira.globus.org/browse/GRAM-223) was fixed in the 5.2 series. We had the same problem with 5.0.4, and it works fine for us with 5.2. Increasing the values will definitely help, but, depending on the activity of
users, may just delay the problem.

Martin

On 23/01/12 6:47 PM, Yuriy Halytskyy wrote:
Hi Brian,

Have a look at
http://technical.bestgrid.org/index.php/Setup_GRAM5_on_CentOS_5#Increase_Open_Files_Limit

Cheers,
Yuriy

On 23/01/12 18:45, Brian O'Connor wrote:
Hi,

I've been using GRAM for a long time now and I'd like to push it into
production but I'm having issues with it.  I submit workflows of
hundreds of jobs each day through an automated submitter so I need to
be able to send jobs to a GRAM server and not have it get in a bad
state after x number of days.  That's the goal at least...

Anyway, the latest problem I've had is with GRAM rejecting incoming
requests because of "Too many open files"

Here's the error:

globus-job-run server.domain.name/jobmanager-sge /bin/hostname

GRAM Job submission failed because Error opening proxy file for
writing: /u/seqware/.globus/job/sqwprod.hpc.oicr.on.ca/16217884770066032596.5836665131371726474/x509_user_proxy:
Too many open files (24) (error code 75)

I checked my proxy and it looks OK:

grid-proxy-info
subject : /O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware/CN=1800547271
issuer : /O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
identity : /O=Grid/OU=GlobusTest/OU=simpleCA-sqwstage.hpc.oicr.on.ca/OU=hpc.oicr.on.ca/CN=Seq
Ware
type     : RFC 3820 compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u1373
timeleft : 479:16:03  (20.0 days)

I then looked at the number of open files for this user:

  /usr/sbin/lsof  | grep seqware | wc -l
2084

Looking at the globus-job-manager it's using up the majority:

ps aux | grep globus-job-man
seqware   175028  0.0  0.0  61200   768 pts/2    R+   00:21   0:00
grep globus-job-man
seqware  4103600  0.1  0.4 116984 18628 ?        S    Jan22   1:26
globus-job-manager -conf
/usr/local/globus/default/etc/globus-job-manager.conf -type sge
seqware  4103647  0.0  0.1  36548  7440 ?        S    Jan22   1:00
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103649  0.0  0.1  36548  7456 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103650  0.0  0.1  36548  7440 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103651  0.0  0.1  36548  7444 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive
seqware  4103652  0.0  0.1  36544  7436 ?        S    Jan22   0:59
perl /usr/local/globus/5.0.2/libexec/globus-job-manager-script.pl -m
sge -c interactive

/usr/sbin/lsof  | grep seqware | grep 4103600 | wc -l
1069

However if I look at this users limits it looks like they can open up
to 32768 files and I can perform other file operations just fine.

  ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 69632
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 69632
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Does anyone know why this is happening?  To date I've been killing the
globus-job-manager when things like this happen.  Is there a guide
somewhere that describes the right way to reset the daemons if
something goes wrong?  Is there a guide for avoiding common pitfalls
and setting up GRAM (in particular) to work in a heavily used grid
install?  I want to be able to push thousands of jobs through the
system but so far it seems to barf on me every few days which has
caused a lot of disruption in our workflows.

I'm currently using 5.0.2, I would like to upgrade but it requires the
IT group to authorize this.  Here's my configuration for the
gatekeeper:

service gsigatekeeper
{
     socket_type      = stream
     wait             = no
     user             = root
     server           = /usr/local/globus/default/sbin/globus-gatekeeper
     server_args      = -conf
/usr/local/globus/default/etc/globus-gatekeeper.conf
     env             += LD_LIBRARY_PATH=/usr/local/globus/default/lib
     env             += GLOBUS_LOCATION=/usr/local/globus/default
     env             += GLOBUS_TCP_PORT_RANGE=40000,41000
     env             += GLOBUS_HOSTNAME=server.domain.name
     env             +=SGE_QMASTER_PORT=6444
     log_on_failure  += USERID
     nice             = 0
     instances        = 100
     max_load         = 200.0
     disable          = no
}


Thanks for your help

--Brian O'Connor


Reply via email to