Ok, thanks for your suggestion. In addition to this server side solution (fixing the jobmanager) we've also created a client side solution. We poll for the job state until this holds:

gramJob.getError() == GramError.GRAM_JOBMANAGER_CONNECTION_FAILURE

Then we remove the directory where the job had run in. This works perfect, because the cancellation is completely done at this moment. So if any one runs into this problem again, there's a way out of it ;-)

Cheers,

Roelof Kemp

Stuart Martin wrote:
Hi Roelof,

I think the root issue here is that GRAM will cancel the SGE job and that may take some time for the job to actually be cancelled. Currently, GRAM does not monitor a job after cancellation. GRAM does not wait until the job is no longer visible / leaves the LRM queue. GRAM essentially triggers the cancellation in the LRM.

So there could be a timing issue here for cleaning up job's files if the LRM job cancellation is slow.

You could try modifying the sge.pm cancel() subroutine to loop and occasionally check if the job has been removed before returning. The GramJob.cancel() call will block waiting for a reply, so delaying the remote cancel will result in knowing for sure that the job has left the SGE queue once the call completes.

Some enhancement here for a more reliable cancel operation for all LRMs probably makes sense here. I'm surprised this has not come up before. Maybe LRMs typically cancel jobs quickly enough.

-Stu

On Mar 7, 2008, at Mar 7, 8:41 AM, Roelof Kemp wrote:

Hi all,

I've the situation in which I submit a GramJob to a SGE jobmanager and run it in a specific directory (that I create just before submitting the job). Sometimes I want to cancel the job and therefore I call the method GramJob.cancel(). Directly after this call I remove the directory the job did run in. This leads sometimes to the situation that indeed the directory is deleted, but the job keeps running. The logging of the sge jobmanager tells me this:

03/06/2008 17:23:28|qmaster|fs0|W|job 176632.1 failed on host xxx general opening input/output file because: 03/06/2008 17:23:28 [1001:23735]: error: can't open output file "xxx/17249.1204820556/stdout": Stale NFS file handle

The stale NFS file handle is probably the reason that the job isn't properly cancelled, and I understand that the cancellation takes some time and that I have to wait for it before deleting the directory. Is there any way to know when the GramJob.cancel() is done? Can I catch a status change in the handleStatusChange method (I do implement GramJobListener)? Which status indicates a successful cancellation? Or can I somehow poll to know whether the cancellation is done?

Can anyone help me with this?

Thanks in advance,

Roelof


Reply via email to