Ok, thanks for your suggestion. In addition to this server side solution
(fixing the jobmanager) we've also created a client side solution. We
poll for the job state until this holds:
gramJob.getError() == GramError.GRAM_JOBMANAGER_CONNECTION_FAILURE
Then we remove the directory where the job had run in. This works
perfect, because the cancellation is completely done at this moment. So
if any one runs into this problem again, there's a way out of it ;-)
Cheers,
Roelof Kemp
Stuart Martin wrote:
Hi Roelof,
I think the root issue here is that GRAM will cancel the SGE job and
that may take some time for the job to actually be cancelled.
Currently, GRAM does not monitor a job after cancellation. GRAM does
not wait until the job is no longer visible / leaves the LRM queue.
GRAM essentially triggers the cancellation in the LRM.
So there could be a timing issue here for cleaning up job's files if
the LRM job cancellation is slow.
You could try modifying the sge.pm cancel() subroutine to loop and
occasionally check if the job has been removed before returning. The
GramJob.cancel() call will block waiting for a reply, so delaying the
remote cancel will result in knowing for sure that the job has left
the SGE queue once the call completes.
Some enhancement here for a more reliable cancel operation for all
LRMs probably makes sense here. I'm surprised this has not come up
before. Maybe LRMs typically cancel jobs quickly enough.
-Stu
On Mar 7, 2008, at Mar 7, 8:41 AM, Roelof Kemp wrote:
Hi all,
I've the situation in which I submit a GramJob to a SGE jobmanager
and run it in a specific directory (that I create just before
submitting the job). Sometimes I want to cancel the job and therefore
I call the method GramJob.cancel(). Directly after this call I remove
the directory the job did run in. This leads sometimes to the
situation that indeed the directory is deleted, but the job keeps
running. The logging of the sge jobmanager tells me this:
03/06/2008 17:23:28|qmaster|fs0|W|job 176632.1 failed on host xxx
general opening input/output file because: 03/06/2008 17:23:28
[1001:23735]: error: can't open output file
"xxx/17249.1204820556/stdout": Stale NFS file handle
The stale NFS file handle is probably the reason that the job isn't
properly cancelled, and I understand that the cancellation takes some
time and that I have to wait for it before deleting the directory. Is
there any way to know when the GramJob.cancel() is done? Can I catch
a status change in the handleStatusChange method (I do implement
GramJobListener)? Which status indicates a successful cancellation?
Or can I somehow poll to know whether the cancellation is done?
Can anyone help me with this?
Thanks in advance,
Roelof