Re: [gt-user] weird situation while cancelling a GramJob

Roelof Kemp Tue, 11 Mar 2008 08:08:40 -0700

Ok, thanks for your suggestion. In addition to this server side solution(fixing the jobmanager) we've also created a client side solution. Wepoll for the job state until this holds:


gramJob.getError() == GramError.GRAM_JOBMANAGER_CONNECTION_FAILURE

Then we remove the directory where the job had run in. This worksperfect, because the cancellation is completely done at this moment. Soif any one runs into this problem again, there's a way out of it ;-)


Cheers,

Roelof Kemp

Stuart Martin wrote:

Hi Roelof,
I think the root issue here is that GRAM will cancel the SGE job andthat may take some time for the job to actually be cancelled.Currently, GRAM does not monitor a job after cancellation. GRAM doesnot wait until the job is no longer visible / leaves the LRM queue.GRAM essentially triggers the cancellation in the LRM.
So there could be a timing issue here for cleaning up job's files ifthe LRM job cancellation is slow.
You could try modifying the sge.pm cancel() subroutine to loop andoccasionally check if the job has been removed before returning. TheGramJob.cancel() call will block waiting for a reply, so delaying theremote cancel will result in knowing for sure that the job has leftthe SGE queue once the call completes.
Some enhancement here for a more reliable cancel operation for allLRMs probably makes sense here. I'm surprised this has not come upbefore. Maybe LRMs typically cancel jobs quickly enough.
-Stu

On Mar 7, 2008, at Mar 7, 8:41 AM, Roelof Kemp wrote:
Hi all,
I've the situation in which I submit a GramJob to a SGE jobmanagerand run it in a specific directory (that I create just beforesubmitting the job). Sometimes I want to cancel the job and thereforeI call the method GramJob.cancel(). Directly after this call I removethe directory the job did run in. This leads sometimes to thesituation that indeed the directory is deleted, but the job keepsrunning. The logging of the sge jobmanager tells me this:
03/06/2008 17:23:28|qmaster|fs0|W|job 176632.1 failed on host xxxgeneral opening input/output file because: 03/06/2008 17:23:28[1001:23735]: error: can't open output file"xxx/17249.1204820556/stdout": Stale NFS file handle
The stale NFS file handle is probably the reason that the job isn'tproperly cancelled, and I understand that the cancellation takes sometime and that I have to wait for it before deleting the directory. Isthere any way to know when the GramJob.cancel() is done? Can I catcha status change in the handleStatusChange method (I do implementGramJobListener)? Which status indicates a successful cancellation?Or can I somehow poll to know whether the cancellation is done?
Can anyone help me with this?

Thanks in advance,

Roelof

Re: [gt-user] weird situation while cancelling a GramJob

Reply via email to