Just to follow up on this, here is some additional evidence:
Globus GRAM debugging:
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor
[SEG-sge-Thread,addEvent:523] JSM receiving scheduler event 1708 [Mon Sep
06 19:49:57 EDT 2010] Done
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGeneratorMonitor
[SEG-sge-Thread,addEvent:534] Dispatching event 1708 to job
26b12c40-ba11-11df-a8c7-93aa2282a0f7
2010-09-06T19:50:00.980-04:00 DEBUG utils.GramExecutorService
[SEG-sge-Thread,execute:52] # tasks: 0
2010-09-06T19:50:00.980-04:00 DEBUG seg.SchedulerEventGenerator
[SEG-sge-Thread,run:171] seg input line: 001;1283816997;1708;8;0
2010-09-06T19:50:00.980-04:00 DEBUG exec.ManagedExecutableJobHome
[pool-2-thread-1,jobStateChanged:399] Receiving jobStateChange event for
resource key {
http://www.globus.org/namespaces/2008/03/gram/job}ResourceID=26b12c40-ba11-11df-a8c7-93aa2282a0f7with:
timestamp Mon Sep 06 19:49:57 EDT 2010
(new) state Done
exitCode 0
The "seg input line", above, occurs in the SGE reporting file here:
1283816997:job_log:1283816997:deleted:1708:1:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
deleted by schedd
1283816997:job_log:1283816997:deleted:1708:7:NONE:T:scheduler:topaz.si.edu:0:1024:1283816885:sge_job_script.38191:globus:staff::defaultdepartment:sge:job
deleted by schedd
(the only two lines in the file with this timestamp; these happen to be the
first 2/8 jobs in the batch to finish)
It's pretty clear that as soon as any sub-job finishes, Globus thinks the
whole batch is done and goes ahead with subsequent processing stages (e.g.,
MergeStdout, StageOut). I'm guessing the place to fix this is in the SEG
code; I'm willing to bet someone already has patched this. If so, would you
be willing to share?
thanks,
Adam
On Mon, Sep 6, 2010 at 5:17 PM, Adam Bazinet <[email protected]>wrote:
> Hi everyone,
>
> Was this issue ever resolved? It is affecting our Globus installation
> (4.2.1) and SGE cluster as well. Specifically, the job seems to enter the
> StageOut phase prematurely (say, when 6/8 jobs in a task array are
> completed). Any assistance is greatly appreciated.
>
> thanks,
> Adam
>
>
>
> On Tue, May 27, 2008 at 12:51 PM, Korambath, Prakashan
> <[email protected]>wrote:
>
>> Hi Martin,
>>
>> I am using gt4.0.6 on the client node. I didn't try with Fork. Let me
>> see how Fork behaves. Thanks.
>>
>> Prakashan
>>
>>
>>
>>
>> -----Original Message-----
>> From: Martin Feller [mailto:[email protected] <[email protected]>]
>> Sent: Tue 5/27/2008 9:48 AM
>> To: Korambath, Prakashan
>> Cc: gt-user; Jin, Kejian; Korambath, Prakashan
>> Subject: Re: [gt-user] Globus GRAM reporting status for each task in a SGE
>> job-job array submission
>>
>> Prakashan:
>>
>> GRAM should send a Done notification if the last job is done, and not when
>> the first job is done. I tried it here and it works as expected for me.
>> What GT version are you using?
>> This is probably not at all SGE related, but does it behave in the same
>> way
>> when you submit to, say, Fork instead of SGE?
>>
>> Martin
>>
>>
>> ----- Original Message -----
>> From: "Prakashan Korambath" <[email protected]>
>> To: "gt-user" <[email protected]>, "Kejian Jin" <[email protected]>,
>> "Prakashan Korambath" <[email protected]>
>> Sent: Monday, May 26, 2008 4:10:46 PM GMT -06:00 US/Canada Central
>> Subject: [gt-user] Globus GRAM reporting status for each task in a SGE
>> job-job array submission
>>
>>
>>
>>
>> Hi,
>>
>> We noticed that Globus GRAM status reporting service (eg: globusrun-ws
>> -status -j job_epr) reports status as 'Done' immediately when first few
>> tasks in a job-array (multi jobs) are completed. Is there a way to make it
>> wait until the last task in the job array is completed? It is ok if all
>> tasks are completed within few seconds apart, but in most cases they are not
>> and globus reports the entire job is finished based on perhaps the reading
>> from $SGE_ROOT/common/reporting file when there are still tasks waiting to
>> be run. If there is an option to query the status of the last task in a job
>> array it would be nice. Thanks.
>>
>>
>> Prakashan Korambath
>>
>>
>