Adam, Yes - please make these changes/patches available. We'd like to take a look to see what changes might be necessary for GRAM in GT5.
Thanks, Stu On Oct 1, 2010, at Oct 1, 11:06 AM, Adam Bazinet wrote: > Dear Prakashan, > > Thanks for your reply. Just confirming that we were running into the same > problem was very helpful. > > Our solution involves modifying both sge.pm and sge's SEG to keep track, > using auxiliary files, of how many sub-jobs in a batch have completed, and to > only send the final "Done" notification when all sub-jobs have completed. > > If anyone is interested in having this, we could send you our code, which is > basically working. It consists of a modified sge.pm, sge SEG, and a separate > shell script. > > thanks, > Adam > > > > On Tue, Sep 7, 2010 at 11:33 AM, Prakashan Korambath <[email protected]> > wrote: > Hi Adam, > > It has been almost two years since we looked at this problem. I have to > refresh my memories, but the problem is that it may be difficult for > reporting file to tell GRAM that last array job is done. Say you submitted > 500 array jobs, and it is always possible that some of the array jobs like > 467, 468 (example) may be still running when job 500 is finished because some > nodes may be overloaded etc. So we can't look at the last array job id even > if that is possible. > > I think the best solution would be for GRAM to report back the JOBID to us > and we independently probe whether all jobs for that jobid is completed using > a Fork job submission periodically. This will work for workflows. > > Prakashan > > > Adam Bazinet wrote: > Just to follow up on this, here is some additional evidence: > > Globus GRAM debugging: > > 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator > [SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0 > 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor > [SEG-sge-Thread,addEvent:523] JSM receiving scheduler event 1708 [Mon Sep 06 > 19:49:57 EDT 2010] Done > 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor > [SEG-sge-Thread,addEvent:534] Dispatching event 1708 to job > 26b12c40-ba11-11df-a8c7-93aa2282a0f7 > 2010-09-06T19:50:00.980-04:00 DEBUG utils.GramExecutorService > [SEG-sge-Thread,execute:52] # tasks: 0 > 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator > [SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0 > 2010-09-06T19:50:00.980-04:00 DEBUG exec.ManagedExecutableJobHome > [pool-2-thread-1,jobStateChanged:399] Receiving jobStateChange event for > resource key > {http://www.globus.org/namespaces/2008/03/gram/job}ResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7 > with: > timestamp Mon Sep 06 19:49:57 EDT 2010 > (new) state Done > exitCode 0 > > The "seg input line", above, occurs in the SGE reporting file here: > > 1283816997:job_log:1283816997:deleted:1708:1:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job > deleted by schedd > 1283816997:job_log:1283816997:deleted:1708:7:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job > deleted by schedd > > (the only two lines in the file with this timestamp; these happen to be the > first 2/8 jobs in the batch to finish) > > It's pretty clear that as soon as any sub-job finishes, Globus thinks the > whole batch is done and goes ahead with subsequent processing stages (e.g., > MergeStdout, StageOut). I'm guessing the place to fix this is in the SEG > code; I'm willing to bet someone already has patched this. If so, would you > be willing to share? > > thanks, > Adam > > > > On Mon, Sep 6, 2010 at 5:17 PM, Adam Bazinet > <[email protected]<mailto:[email protected]>> wrote: > Hi everyone, > > Was this issue ever resolved? It is affecting our Globus installation > (4.2.1) and SGE cluster as well. Specifically, the job seems to enter the > StageOut phase prematurely (say, when 6/8 jobs in a task array are > completed). Any assistance is greatly appreciated. > > thanks, > Adam > > > > On Tue, May 27, 2008 at 12:51 PM, Korambath, Prakashan > <[email protected]<mailto:[email protected]>> wrote: > > Hi Martin, > > I am using gt4.0.6 on the client node. I didn't try with Fork. Let me see > how Fork behaves. Thanks. > > Prakashan > > > > > -----Original Message----- > From: Martin Feller [mailto:[email protected]] > Sent: Tue 5/27/2008 9:48 AM > To: Korambath, Prakashan > Cc: gt-user; Jin, Kejian; Korambath, Prakashan > Subject: Re: [gt-user] Globus GRAM reporting status for each task in a SGE > job-job array submission > > Prakashan: > > GRAM should send a Done notification if the last job is done, and not when > the first job is done. I tried it here and it works as expected for me. > What GT version are you using? > This is probably not at all SGE related, but does it behave in the same way > when you submit to, say, Fork instead of SGE? > > Martin > > > ----- Original Message ----- > From: "Prakashan Korambath" <[email protected]<mailto:[email protected]>> > To: "gt-user" <[email protected]<mailto:[email protected]>>, "Kejian Jin" > <[email protected]<mailto:[email protected]>>, "Prakashan Korambath" > <[email protected]<mailto:[email protected]>> > Sent: Monday, May 26, 2008 4:10:46 PM GMT -06:00 US/Canada Central > Subject: [gt-user] Globus GRAM reporting status for each task in a SGE > job-job array submission > > > > > Hi, > > We noticed that Globus GRAM status reporting service (eg: globusrun-ws > -status -j job_epr) reports status as 'Done' immediately when first few tasks > in a job-array (multi jobs) are completed. Is there a way to make it wait > until the last task in the job array is completed? It is ok if all tasks are > completed within few seconds apart, but in most cases they are not and globus > reports the entire job is finished based on perhaps the reading from > $SGE_ROOT/common/reporting file when there are still tasks waiting to be run. > If there is an option to query the status of the last task in a job array it > would be nice. Thanks. > > > Prakashan Korambath > > > > >
