Hi, I'm in the process of stress testing a system for use in production. The software stack is:
pegasus workflows -> condor-G -> GRAM -> SGE I recently upgraded to GT 5.2 to attempt to address stability issues. I was having a whole series of issues with the globus-job-manager either becoming unresponsive, dying, opening too many file handles, etc. Sorry in advanced if I ask some stupid questions below, I'm still trying to wrap my mind around the whole software stack and the best way to set it up. I'm not submitting tens of thousands of jobs here... something on the order of <500 max. So I feel that I should be able to get this system rock-solid yet it's been super flaky for me which is troubling. So this brings up my first question: 1) What do you do when the globus-job-manager dies? It seems like this is a very critical daemon, the URLs I get back if I manually use globus-job-submit no longer work and I end up with random jobs running on the cluster that I have no way to shutdown. My question here is how do I "cleanup" after a globus-job-manager crash (or a server reboot) and is there a way to trigger a new instance to startup and watch over the jobs left running on the cluster? From the calling client's perspective how would they reconnect with these lost jobs when the job URL is no longer valid? Has anyone had any experiences with the added complexity of Condor-G and how that deals with a globus-job-manager failure? My second question is related to SEG. I know SEG is the better way to go. I'm currently using polling, though, since our sysadmin and I weren't able to get SEG to work with SGE. When deployed we saw it parse through the SGE log and, once it hits the end, it started looking for a reporting.0 file rather than just wait for more data to be written to the log file. There is no reporting.0 file so the event generator just sat there and kept looking for it rather than watching the real log file. So question (2) has anyone seen problems with the SEG module for SGE looking for a non-existent reporting file and, therefore, missing new job events? Is there any way to explicitly tell SEG to read only one reporting file and not to try looking for rotated log files? OK, since using polling (at least until we get SEG working) that brings me to the third and final questions. I submitted about 25 workflows via Condor DagMan. At any one time there are between 1 and 20 jobs running per workflow. The load on the host (which is our condor machine, GRAM server, and SGE submission host) is only slight. However, when I do: condor_q -long 6731 And look at a job that should have finished hours ago I see the Globus URL is: https://sqwstage:59384/16217785859533749891/5764607537333819270/ So I check the status: globus-job-status https://sqwstage:59384/16217785859533749891/5764607537333819270/ ACTIVE So Globus (and Condor by extension) think the job is running. It shouldn't be, though since this is a fast running job. I then look at every job running on the SGE cluster: for i in `qstat | grep seqware | awk '{print $1}'`; do qstat -j | grep '16217785859533749891.5764607537333819270'; done; And I get nothing... the job is not running on the cluster. I think it's because it ran and finished but Globus didn't pick up on this. I suppose it's possible the job never ran but unlikely because I think Globus would report pending. Anyway, this brings me to question (3), what do I do in this situation, when globus looses track of a job state in SGE? I hate to cancel the job and trigger condor to re-run it. Is there a way to request the globus-job-manager to check this job and, if it's not on the cluster, signal that it's done? I can write code that monitors my running condor/globus jobs and looks for those that aren't running on the cluster but respond as ACTIVE but this just feels like a total hack. These all sound like horrible options to me. So my extended question is why is this happening at all? I thought the whole point of using Globus here is being able to submit jobs to a scheduler and not having to worry about issues like this. Are my expectations too high? Is it normal to have jobs lost track of? Is it not unusual to have to monitor jobs across the various layers and manually intervene? Thanks very much for any guidance, suggestions, hints, or comments you can provide. Any tips you can give me that will help us get the software stack stable will be very welcome!! --Brian O'Connor
