Dear Adam,
Glad to hear that you resolved the problem. Please post it
in the mailing list. Thanks.
Prakashan
On 10/01/2010 09:06 AM, Adam Bazinet wrote:
Dear Prakashan,
Thanks for your reply. Just confirming that we were running into the same
problem was very helpful.
Our solution involves modifying both sge.pm<http://sge.pm> and sge's SEG to keep track,
using auxiliary files, of how many sub-jobs in a batch have completed, and to only send the
final "Done" notification when all sub-jobs have completed.
If anyone is interested in having this, we could send you our code, which is
basically working. It consists of a modified sge.pm<http://sge.pm>, sge SEG,
and a separate shell script.
thanks,
Adam
On Tue, Sep 7, 2010 at 11:33 AM, Prakashan
Korambath<[email protected]<mailto:[email protected]>> wrote:
Hi Adam,
It has been almost two years since we looked at this problem. I have to
refresh my memories, but the problem is that it may be difficult for reporting
file to tell GRAM that last array job is done. Say you submitted 500 array
jobs, and it is always possible that some of the array jobs like 467, 468
(example) may be still running when job 500 is finished because some nodes may
be overloaded etc. So we can't look at the last array job id even if that is
possible.
I think the best solution would be for GRAM to report back the JOBID to us and
we independently probe whether all jobs for that jobid is completed using a
Fork job submission periodically. This will work for workflows.
Prakashan
Adam Bazinet wrote:
Just to follow up on this, here is some additional evidence:
Globus GRAM debugging:
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor
[SEG-sge-Thread,addEvent:523] JSM receiving scheduler event 1708 [Mon Sep 06
19:49:57 EDT 2010] Done
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor
[SEG-sge-Thread,addEvent:534] Dispatching event 1708 to job
26b12c40-ba11-11df-a8c7-93aa2282a0f7
2010-09-06T19:50:00.980-04:00 DEBUG utils.GramExecutorService
[SEG-sge-Thread,execute:52] # tasks: 0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG exec.ManagedExecutableJobHome
[pool-2-thread-1,jobStateChanged:399] Receiving jobStateChange event for resource key
{http://www.globus.org/namespaces/2008/03/gram/job}ResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7<http://www.globus.org/namespaces/2008/03/gram/job%7DResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7>
with:
timestamp Mon Sep 06 19:49:57 EDT 2010
(new) state Done
exitCode 0
The "seg input line", above, occurs in the SGE reporting file here:
1283816997:job_log:1283816997:deleted:1708:1:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
deleted by schedd
1283816997:job_log:1283816997:deleted:1708:7:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
deleted by schedd
(the only two lines in the file with this timestamp; these happen to be the
first 2/8 jobs in the batch to finish)
It's pretty clear that as soon as any sub-job finishes, Globus thinks the whole
batch is done and goes ahead with subsequent processing stages (e.g.,
MergeStdout, StageOut). I'm guessing the place to fix this is in the SEG code;
I'm willing to bet someone already has patched this. If so, would you be
willing to share?
thanks,
Adam
On Mon, Sep 6, 2010 at 5:17 PM, Adam
Bazinet<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
wrote:
Hi everyone,
Was this issue ever resolved? It is affecting our Globus installation (4.2.1)
and SGE cluster as well. Specifically, the job seems to enter the StageOut
phase prematurely (say, when 6/8 jobs in a task array are completed). Any
assistance is greatly appreciated.
thanks,
Adam
On Tue, May 27, 2008 at 12:51 PM, Korambath,
Prakashan<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
wrote:
Hi Martin,
I am using gt4.0.6 on the client node. I didn't try with Fork. Let me see
how Fork behaves. Thanks.
Prakashan
-----Original Message-----
From: Martin Feller [mailto:[email protected]<mailto:[email protected]>]
Sent: Tue 5/27/2008 9:48 AM
To: Korambath, Prakashan
Cc: gt-user; Jin, Kejian; Korambath, Prakashan
Subject: Re: [gt-user] Globus GRAM reporting status for each task in a SGE
job-job array submission
Prakashan:
GRAM should send a Done notification if the last job is done, and not when
the first job is done. I tried it here and it works as expected for me.
What GT version are you using?
This is probably not at all SGE related, but does it behave in the same way
when you submit to, say, Fork instead of SGE?
Martin
----- Original Message -----
From: "Prakashan
Korambath"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
To: "gt-user"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>, "Kejian
Jin"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>, "Prakashan
Korambath"<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
Sent: Monday, May 26, 2008 4:10:46 PM GMT -06:00 US/Canada Central
Subject: [gt-user] Globus GRAM reporting status for each task in a SGE job-job
array submission
Hi,
We noticed that Globus GRAM status reporting service (eg: globusrun-ws -status
-j job_epr) reports status as 'Done' immediately when first few tasks in a
job-array (multi jobs) are completed. Is there a way to make it wait until the
last task in the job array is completed? It is ok if all tasks are completed
within few seconds apart, but in most cases they are not and globus reports the
entire job is finished based on perhaps the reading from
$SGE_ROOT/common/reporting file when there are still tasks waiting to be run.
If there is an option to query the status of the last task in a job array it
would be nice. Thanks.
Prakashan Korambath