Hi,
Thanks for your reply.

I Think there are some places to store each of job state information. But
where and how that information can be stored or retrieved ?

> Hi Tonny,
>
> GRAM is fault tolerant, meaning that when/if the container or service
> host crashes, the job details are not lost.  When the GRAM4 service is
> restarted, then the processing/monitoring of the job resumes.  GRAM2
> requires user/client intervention to restart the processing of the job.
>
> If the job included file stage in directives and those had not
> completed at the time of the crash, then gram will resume processing
> the job for that job state and continue until the job has been fully
> processed.
>
> If the job had already been submitted to the local resource manager,
> then GRAM will resume monitoring the job in the LRM and continue
> processing the job to completion.  GRAM persists the LRM job id.  If
> the crash included the LRM and the LRM is also fault tolerant and
> resumes processing of the job, then the job will be completely
> processed without requiring any client intervention.
>
> A persistent connection between the GRAM client and service is not
> maintained, so network failures between the client and service can be
> overcome.
>
> In GRAM4 (WS GRAM), an EPR is included in the reply to
> createManagedJob.  This allows the client to contact the service when
> desired to get the current job status, cancel the job, subscribe for
> notifications, ...
>
> If the createManagedJob call is received by the GRAM service, but the
> reply (containing the EPR) is not received by the client (possibly due
> to network failure), then GRAM4 provides the means to subsequently get
> the EPR in order to control the previously submitted job.
> Detail about that are here:
> http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-submissionid
>
> Lemme know if you have any more questions on this.
>
> Regards,
> -Stu
>
> On Jun 19, 2008, at Jun 19, 10:00 AM, [EMAIL PROTECTED]
> wrote:
>
>>
>> Hi,
>>
>> I'm not quite understand about how GT4 manages job that was
>> submitted when
>> some failures happen, for example lost connection with client that
>> caused
>> by temporary network failure and lost contact with LRM that caused by
>> globus being restarted during job execution.
>>
>> does anybody know about this ?
>>
>> Regards
>>
>> Tonny
>>
>
>


Reply via email to