Dear Adam,

Glad to hear that you resolved the problem. Please post it in the mailing list. Thanks.

Prakashan


On 10/01/2010 09:06 AM, Adam Bazinet wrote:
Dear Prakashan,

Thanks for your reply.  Just confirming that we were running into the same 
problem was very helpful.

Our solution involves modifying both sge.pm<http://sge.pm>  and sge's SEG to keep track, 
using auxiliary files, of how many sub-jobs in a batch have completed, and to only send the 
final "Done" notification when all sub-jobs have completed.

If anyone is interested in having this, we could send you our code, which is 
basically working.  It consists of a modified sge.pm<http://sge.pm>, sge SEG, 
and a separate shell script.

thanks,
Adam



On Tue, Sep 7, 2010 at 11:33 AM, Prakashan 
Korambath<[email protected]<mailto:[email protected]>>  wrote:
Hi Adam,

It has been almost two years since we looked at this problem.  I have to 
refresh my memories, but the problem is that it may be difficult for reporting 
file to tell GRAM that last array job is done.  Say you submitted 500 array 
jobs, and it is always possible that some of the array jobs like 467, 468 
(example) may be still running when job 500 is finished because some nodes may 
be overloaded etc.  So we can't look at the last array job id even if that is 
possible.

I think the best solution would be for GRAM to report back the JOBID to us and 
we independently probe whether all jobs for that jobid is completed using a 
Fork job submission periodically.  This will work for workflows.

Prakashan


Adam Bazinet wrote:
Just to follow up on this, here is some additional evidence:

Globus GRAM debugging:

2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator 
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor 
[SEG-sge-Thread,addEvent:523]  JSM receiving scheduler event 1708 [Mon Sep 06 
19:49:57 EDT 2010] Done
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor 
[SEG-sge-Thread,addEvent:534] Dispatching event 1708 to job 
26b12c40-ba11-11df-a8c7-93aa2282a0f7
2010-09-06T19:50:00.980-04:00 DEBUG utils.GramExecutorService 
[SEG-sge-Thread,execute:52] # tasks: 0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator 
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG exec.ManagedExecutableJobHome 
[pool-2-thread-1,jobStateChanged:399] Receiving jobStateChange event for resource key 
{http://www.globus.org/namespaces/2008/03/gram/job}ResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7<http://www.globus.org/namespaces/2008/03/gram/job%7DResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7>
  with:
timestamp Mon Sep 06 19:49:57 EDT 2010
(new) state Done
exitCode 0

The "seg input line", above, occurs in the SGE reporting file here:

1283816997:job_log:1283816997:deleted:1708:1:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
 deleted by schedd
1283816997:job_log:1283816997:deleted:1708:7:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
 deleted by schedd

(the only two lines in the file with this timestamp; these happen to be the 
first 2/8 jobs in the batch to finish)

It's pretty clear that as soon as any sub-job finishes, Globus thinks the whole 
batch is done and goes ahead with subsequent processing stages (e.g., 
MergeStdout, StageOut).  I'm guessing the place to fix this is in the SEG code; 
I'm willing to bet someone already has patched this.  If so, would you be 
willing to share?

thanks,
Adam



On Mon, Sep 6, 2010 at 5:17 PM, Adam 
Bazinet<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
  wrote:
Hi everyone,

Was this issue ever resolved?  It is affecting our Globus installation (4.2.1) 
and SGE cluster as well.  Specifically, the job seems to enter the StageOut 
phase prematurely (say, when 6/8 jobs in a task array are completed).  Any 
assistance is greatly appreciated.

thanks,
Adam



On Tue, May 27, 2008 at 12:51 PM, Korambath, 
Prakashan<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
  wrote:

Hi Martin,

  I am using gt4.0.6 on the client node.  I didn't try with Fork.  Let me see 
how Fork behaves.  Thanks.

Prakashan




-----Original Message-----
From: Martin Feller [mailto:[email protected]<mailto:[email protected]>]
Sent: Tue 5/27/2008 9:48 AM
To: Korambath, Prakashan
Cc: gt-user; Jin, Kejian; Korambath, Prakashan
Subject: Re: [gt-user] Globus GRAM reporting status for each task in a SGE 
job-job array submission

Prakashan:

GRAM should send a Done notification if the last job is done, and not when
the first job is done. I tried it here and it works as expected for me.
What GT version are you using?
This is probably not at all SGE related, but does it behave in the same way
when you submit to, say, Fork instead of SGE?

Martin


----- Original Message -----
From: "Prakashan 
Korambath"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
To: "gt-user"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>, "Kejian 
Jin"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>, "Prakashan 
Korambath"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
Sent: Monday, May 26, 2008 4:10:46 PM GMT -06:00 US/Canada Central
Subject: [gt-user] Globus GRAM reporting status for each task in a SGE job-job 
array submission




Hi,

We noticed that Globus GRAM status reporting service (eg: globusrun-ws -status 
-j job_epr) reports status as 'Done' immediately when first few tasks in a 
job-array (multi jobs) are completed. Is there a way to make it wait until the 
last task in the job array is completed? It is ok if all tasks are completed 
within few seconds apart, but in most cases they are not and globus reports the 
entire job is finished based on perhaps the reading from 
$SGE_ROOT/common/reporting file when there are still tasks waiting to be run. 
If there is an option to query the status of the last task in a job array it 
would be nice. Thanks.


Prakashan Korambath






Reply via email to