Re: [gt-user] Globus GRAM reporting status for each task in a SGE job-job array submission

Stuart Martin Fri, 01 Oct 2010 09:23:12 -0700

Adam,

Yes - please make these changes/patches available.  We'd like to take a look to 
see what changes might be necessary for GRAM in GT5.


Thanks,
Stu

On Oct 1, 2010, at Oct 1, 11:06 AM, Adam Bazinet wrote:

> Dear Prakashan,
> 
> Thanks for your reply.  Just confirming that we were running into the same 
> problem was very helpful.
> 
> Our solution involves modifying both sge.pm and sge's SEG to keep track, 
> using auxiliary files, of how many sub-jobs in a batch have completed, and to 
> only send the final "Done" notification when all sub-jobs have completed.
> 
> If anyone is interested in having this, we could send you our code, which is 
> basically working.  It consists of a modified sge.pm, sge SEG, and a separate 
> shell script.
> 
> thanks,
> Adam
> 
> 
> 
> On Tue, Sep 7, 2010 at 11:33 AM, Prakashan Korambath <[email protected]> 
> wrote:
> Hi Adam,
> 
> It has been almost two years since we looked at this problem.  I have to 
> refresh my memories, but the problem is that it may be difficult for 
> reporting file to tell GRAM that last array job is done.  Say you submitted 
> 500 array jobs, and it is always possible that some of the array jobs like 
> 467, 468 (example) may be still running when job 500 is finished because some 
> nodes may be overloaded etc.  So we can't look at the last array job id even 
> if that is possible.
> 
> I think the best solution would be for GRAM to report back the JOBID to us 
> and we independently probe whether all jobs for that jobid is completed using 
> a Fork job submission periodically.  This will work for workflows.
> 
> Prakashan
> 
> 
> Adam Bazinet wrote:
> Just to follow up on this, here is some additional evidence:
> 
> Globus GRAM debugging:
> 
> 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator 
> [SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
> 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor 
> [SEG-sge-Thread,addEvent:523]  JSM receiving scheduler event 1708 [Mon Sep 06 
> 19:49:57 EDT 2010] Done
> 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor 
> [SEG-sge-Thread,addEvent:534] Dispatching event 1708 to job 
> 26b12c40-ba11-11df-a8c7-93aa2282a0f7
> 2010-09-06T19:50:00.980-04:00 DEBUG utils.GramExecutorService 
> [SEG-sge-Thread,execute:52] # tasks: 0
> 2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator 
> [SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
> 2010-09-06T19:50:00.980-04:00 DEBUG exec.ManagedExecutableJobHome 
> [pool-2-thread-1,jobStateChanged:399] Receiving jobStateChange event for 
> resource key 
> {http://www.globus.org/namespaces/2008/03/gram/job}ResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7
>  with:
> timestamp Mon Sep 06 19:49:57 EDT 2010
> (new) state Done
> exitCode 0
> 
> The "seg input line", above, occurs in the SGE reporting file here:
> 
> 1283816997:job_log:1283816997:deleted:1708:1:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
>  deleted by schedd
> 1283816997:job_log:1283816997:deleted:1708:7:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
>  deleted by schedd
> 
> (the only two lines in the file with this timestamp; these happen to be the 
> first 2/8 jobs in the batch to finish)
> 
> It's pretty clear that as soon as any sub-job finishes, Globus thinks the 
> whole batch is done and goes ahead with subsequent processing stages (e.g., 
> MergeStdout, StageOut).  I'm guessing the place to fix this is in the SEG 
> code; I'm willing to bet someone already has patched this.  If so, would you 
> be willing to share?
> 
> thanks,
> Adam
> 
> 
> 
> On Mon, Sep 6, 2010 at 5:17 PM, Adam Bazinet 
> <[email protected]<mailto:[email protected]>> wrote:
> Hi everyone,
> 
> Was this issue ever resolved?  It is affecting our Globus installation 
> (4.2.1) and SGE cluster as well.  Specifically, the job seems to enter the 
> StageOut phase prematurely (say, when 6/8 jobs in a task array are 
> completed).  Any assistance is greatly appreciated.
> 
> thanks,
> Adam
> 
> 
> 
> On Tue, May 27, 2008 at 12:51 PM, Korambath, Prakashan 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Martin,
> 
>  I am using gt4.0.6 on the client node.  I didn't try with Fork.  Let me see 
> how Fork behaves.  Thanks.
> 
> Prakashan
> 
> 
> 
> 
> -----Original Message-----
> From: Martin Feller [mailto:[email protected]]
> Sent: Tue 5/27/2008 9:48 AM
> To: Korambath, Prakashan
> Cc: gt-user; Jin, Kejian; Korambath, Prakashan
> Subject: Re: [gt-user] Globus GRAM reporting status for each task in a SGE 
> job-job array submission
> 
> Prakashan:
> 
> GRAM should send a Done notification if the last job is done, and not when
> the first job is done. I tried it here and it works as expected for me.
> What GT version are you using?
> This is probably not at all SGE related, but does it behave in the same way
> when you submit to, say, Fork instead of SGE?
> 
> Martin
> 
> 
> ----- Original Message -----
> From: "Prakashan Korambath" <[email protected]<mailto:[email protected]>>
> To: "gt-user" <[email protected]<mailto:[email protected]>>, "Kejian Jin" 
> <[email protected]<mailto:[email protected]>>, "Prakashan Korambath" 
> <[email protected]<mailto:[email protected]>>
> Sent: Monday, May 26, 2008 4:10:46 PM GMT -06:00 US/Canada Central
> Subject: [gt-user] Globus GRAM reporting status for each task in a SGE 
> job-job array submission
> 
> 
> 
> 
> Hi,
> 
> We noticed that Globus GRAM status reporting service (eg: globusrun-ws 
> -status -j job_epr) reports status as 'Done' immediately when first few tasks 
> in a job-array (multi jobs) are completed. Is there a way to make it wait 
> until the last task in the job array is completed? It is ok if all tasks are 
> completed within few seconds apart, but in most cases they are not and globus 
> reports the entire job is finished based on perhaps the reading from 
> $SGE_ROOT/common/reporting file when there are still tasks waiting to be run. 
> If there is an option to query the status of the last task in a job array it 
> would be nice. Thanks.
> 
> 
> Prakashan Korambath
> 
> 
> 
> 
>

Re: [gt-user] Globus GRAM reporting status for each task in a SGE job-job array submission

Reply via email to