Hi Joseph, Thanks very much for your email.
We actually just had a failure (reboot) of out Globus box just a couple hours ago. So this gets at my question below about how to cleanup after a failure. When the machine rebooted I now see a ton of globus-job-managers running as my "seqware" user (the one that originally submitted the globus jobs). 11517 seqware 16 0 30180 3532 1500 R 4.3 0.1 0:00.80 top 3530 seqware 16 0 99372 3812 3068 S 0.6 0.1 0:02.62 globus-job-mana 3532 seqware 15 0 99372 3812 3068 S 0.6 0.1 0:02.51 globus-job-mana 3553 seqware 16 0 99372 3816 3068 S 0.6 0.1 0:02.53 globus-job-mana 3591 seqware 15 0 99372 3808 3068 S 0.6 0.1 0:02.54 globus-job-mana 3716 seqware 15 0 99372 3812 3068 S 0.6 0.1 0:02.39 globus-job-mana 3739 seqware 15 0 99372 3812 3068 S 0.6 0.1 0:02.47 globus-job-mana [seqware@sqwprod ~]$ ps aux | grep globus-job-manager | grep seqware | wc -l 1837 So there are 1837 of these daemons running. I can no longer submit a cluster job using: globus-job-run sqwprod/jobmanager-sge /bin/hostname It just hangs. I think this is because there is a lock file ~/.globus/job/sqwprod.hpc.oicr.on.ca/ sge.4572dcea.lock My questions are 1) what's the proper way to reset here, kill all the globus-job-managers, remove the lock, and allow the job manager to repawn? 2) Why doesn't globus-job-manager (or the gateway) look at sge.4572dcea.pid and realize the previous globus-job-manager is dead? Shouldn't it detect this, cleanup it's state, and launch a single replacement? Thanks for your help. I really appreciate it! --Brian On Mon, Feb 6, 2012 at 11:47 AM, Joseph Bester <[email protected]> wrote: > On Feb 6, 2012, at 1:43 AM, Brian O'Connor wrote: >> Hi, >> >> I'm in the process of stress testing a system for use in production. >> The software stack is: >> >> pegasus workflows -> condor-G -> GRAM -> SGE >> >> I recently upgraded to GT 5.2 to attempt to address stability issues. >> I was having a whole series of issues with the globus-job-manager >> either becoming unresponsive, dying, opening too many file handles, >> etc. Sorry in advanced if I ask some stupid questions below, I'm >> still trying to wrap my mind around the whole software stack and the >> best way to set it up. I'm not submitting tens of thousands of jobs >> here... something on the order of <500 max. So I feel that I should >> be able to get this system rock-solid yet it's been super flaky for me >> which is troubling. >> >> So this brings up my first question: >> >> 1) What do you do when the globus-job-manager dies? It seems like >> this is a very critical daemon, the URLs I get back if I manually use >> globus-job-submit no longer work and I end up with random jobs running >> on the cluster that I have no way to shutdown. My question here is >> how do I "cleanup" after a globus-job-manager crash (or a server >> reboot) and is there a way to trigger a new instance to startup and >> watch over the jobs left running on the cluster? From the calling >> client's perspective how would they reconnect with these lost jobs >> when the job URL is no longer valid? Has anyone had any experiences >> with the added complexity of Condor-G and how that deals with a >> globus-job-manager failure? > > GRAM-wise, it's possible to get the new handle by submitting a restart > job with the old job handle. This will work even if the old job manager > is running, in which case it will return the old handle. > > If any operation that causes the job manager to get started (submit, restart, > version check) detects that the job manager is not running, it will make > sure one is started to monitor existing jobs (as long as it has a valid > proxy). > It will automatically clean up jobs if the client has done the two-phase > commit end. For 5.2.1, we're working on having some expiration time for > missed two-phase commits so that a job completely abandoned by a user will > eventually get cleaned up. > > So, Condor should be able to get a new handle to the job by submitting a > restart job with the old job handle. I'm not too sure about the specifics > of the condor-G case, but it seems to do a lot of job manager restarts, so > I'd be a little surprised if it isn't capable of getting the handle back. > Maybe ask on a condor list about what you are seeing. > > Of course, if the job manager is crashing, please report a problem to > jira.globus.org so we can fix it. > >> My second question is related to SEG. I know SEG is the better way to >> go. I'm currently using polling, though, since our sysadmin and I >> weren't able to get SEG to work with SGE. When deployed we saw it >> parse through the SGE log and, once it hits the end, it started >> looking for a reporting.0 file rather than just wait for more data to >> be written to the log file. There is no reporting.0 file so the event >> generator just sat there and kept looking for it rather than watching >> the real log file. So question (2) has anyone seen problems with the >> SEG module for SGE looking for a non-existent reporting file and, >> therefore, missing new job events? Is there any way to explicitly >> tell SEG to read only one reporting file and not to try looking for >> rotated log files? > > Based on a quick reading of the code, it looks like it should recover back to > reading the main log file after processing any rotated one. One potential > hangup with the SGE module is that it relies on SGE writing the reporting > file, but not processing that file in with the dbwriter process, as that > alters the contents of the reporting file in a way that confuses the SEG > module. > >> OK, since using polling (at least until we get SEG working) that >> brings me to the third and final questions. I submitted about 25 >> workflows via Condor DagMan. At any one time there are between 1 and >> 20 jobs running per workflow. The load on the host (which is our >> condor machine, GRAM server, and SGE submission host) is only slight. >> However, when I do: >> >> condor_q -long 6731 >> >> And look at a job that should have finished hours ago I see the Globus URL >> is: >> >> https://sqwstage:59384/16217785859533749891/5764607537333819270/ >> >> So I check the status: >> >> globus-job-status >> https://sqwstage:59384/16217785859533749891/5764607537333819270/ >> ACTIVE >> >> So Globus (and Condor by extension) think the job is running. It >> shouldn't be, though since this is a fast running job. >> >> I then look at every job running on the SGE cluster: >> >> for i in `qstat | grep seqware | awk '{print $1}'`; do qstat -j | grep >> '16217785859533749891.5764607537333819270'; done; >> >> And I get nothing... the job is not running on the cluster. I think >> it's because it ran and finished but Globus didn't pick up on this. I >> suppose it's possible the job never ran but unlikely because I think >> Globus would report pending. >> >> Anyway, this brings me to question (3), what do I do in this >> situation, when globus looses track of a job state in SGE? I hate to >> cancel the job and trigger condor to re-run it. Is there a way to >> request the globus-job-manager to check this job and, if it's not on >> the cluster, signal that it's done? I can write code that monitors my >> running condor/globus jobs and looks for those that aren't running on >> the cluster but respond as ACTIVE but this just feels like a total >> hack. These all sound like horrible options to me. So my extended >> question is why is this happening at all? I thought the whole point >> of using Globus here is being able to submit jobs to a scheduler and >> not having to worry about issues like this. Are my expectations too >> high? Is it normal to have jobs lost track of? Is it not unusual to >> have to monitor jobs across the various layers and manually intervene? > > The poll method in the sge.pm module is run periodically for all jobs that > GRAM knows about. If you set GRAM to log DEBUG messages, you might see > some info from that script to let you know what it is doing. See > http://www.globus.org/toolkit/docs/5.2/5.2.0/gram5/admin/#id2483565 > for dealing with the log options. > >> Thanks very much for any guidance, suggestions, hints, or comments you >> can provide. Any tips you can give me that will help us get the >> software stack stable will be very welcome!! >> >> --Brian O'Connor >
