Hi Tonny,
GRAM is fault tolerant, meaning that when/if the container or service
host crashes, the job details are not lost. When the GRAM4 service is
restarted, then the processing/monitoring of the job resumes. GRAM2
requires user/client intervention to restart the processing of the job.
If the job included file stage in directives and those had not
completed at the time of the crash, then gram will resume processing
the job for that job state and continue until the job has been fully
processed.
If the job had already been submitted to the local resource manager,
then GRAM will resume monitoring the job in the LRM and continue
processing the job to completion. GRAM persists the LRM job id. If
the crash included the LRM and the LRM is also fault tolerant and
resumes processing of the job, then the job will be completely
processed without requiring any client intervention.
A persistent connection between the GRAM client and service is not
maintained, so network failures between the client and service can be
overcome.
In GRAM4 (WS GRAM), an EPR is included in the reply to
createManagedJob. This allows the client to contact the service when
desired to get the current job status, cancel the job, subscribe for
notifications, ...
If the createManagedJob call is received by the GRAM service, but the
reply (containing the EPR) is not received by the client (possibly due
to network failure), then GRAM4 provides the means to subsequently get
the EPR in order to control the previously submitted job.
Detail about that are here:
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-submissionid
Lemme know if you have any more questions on this.
Regards,
-Stu
On Jun 19, 2008, at Jun 19, 10:00 AM, [EMAIL PROTECTED]
wrote:
Hi,
I'm not quite understand about how GT4 manages job that was
submitted when
some failures happen, for example lost connection with client that
caused
by temporary network failure and lost contact with LRM that caused by
globus being restarted during job execution.
does anybody know about this ?
Regards
Tonny