Hi, Thanks for your reply. I Think there are some places to store each of job state information. But where and how that information can be stored or retrieved ?
> Hi Tonny, > > GRAM is fault tolerant, meaning that when/if the container or service > host crashes, the job details are not lost. When the GRAM4 service is > restarted, then the processing/monitoring of the job resumes. GRAM2 > requires user/client intervention to restart the processing of the job. > > If the job included file stage in directives and those had not > completed at the time of the crash, then gram will resume processing > the job for that job state and continue until the job has been fully > processed. > > If the job had already been submitted to the local resource manager, > then GRAM will resume monitoring the job in the LRM and continue > processing the job to completion. GRAM persists the LRM job id. If > the crash included the LRM and the LRM is also fault tolerant and > resumes processing of the job, then the job will be completely > processed without requiring any client intervention. > > A persistent connection between the GRAM client and service is not > maintained, so network failures between the client and service can be > overcome. > > In GRAM4 (WS GRAM), an EPR is included in the reply to > createManagedJob. This allows the client to contact the service when > desired to get the current job status, cancel the job, subscribe for > notifications, ... > > If the createManagedJob call is received by the GRAM service, but the > reply (containing the EPR) is not received by the client (possibly due > to network failure), then GRAM4 provides the means to subsequently get > the EPR in order to control the previously submitted job. > Detail about that are here: > http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/user-index.html#s-wsgram-user-submissionid > > Lemme know if you have any more questions on this. > > Regards, > -Stu > > On Jun 19, 2008, at Jun 19, 10:00 AM, [EMAIL PROTECTED] > wrote: > >> >> Hi, >> >> I'm not quite understand about how GT4 manages job that was >> submitted when >> some failures happen, for example lost connection with client that >> caused >> by temporary network failure and lost contact with LRM that caused by >> globus being restarted during job execution. >> >> does anybody know about this ? >> >> Regards >> >> Tonny >> > >
