Hi Matthias,

I can't speak for GridEngine's specific behavior because I haven't used it
in a long time, but it's not surprising that jobs "disappear" as soon as
they've exited. Unfortunately, Galaxy uses periodic polling rather than
waiting on completion. We'd need to create a thread-per-submitted job
unless you can still get job exit details by looping over jobs with a
timeout wait.

You can gain some control over how Galaxy handles InvalidJobException
exceptions with drmaa job runner plugin params, see here:

https://github.com/galaxyproject/galaxy/blob/dev/config/job_conf.xml.sample_advanced#L9

However, if normally finished jobs also result in InvalidJobException, that
probably won't help. Alternatively, you could create a DRMAAJobRunner
subclass for GridEngine like we've done for Slurm that does some digging to
learn more about terminal jobs.

--nate

On Thu, Jun 15, 2017 at 10:27 AM, Matthias Bernt <m.be...@ufz.de> wrote:

> Dear list,
>
> I have two question for all DRMAA users. Here is the first one.
>
> I was checking how our queuing system (univa GridEngine) and Galaxy react
> if jobs are submitted that exceed run time or memory limits.
>
> I found out that the python drmaa library cannot query the job status
> after the job is finished (for both successful and unsuccessful jobs).
>
> In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
>     self.ds.job_status( external_job_id )
>
> Is this always the case? Or might this be a problem with our GridEngine?
>
> I have attached some code for testing. Here the first call to
> s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
> But I get "drmaa.errors.InvalidJobException: code 18: The job specified
> by the 'jobid' does not exist."
>
> The same error pops up in the Galaxy logs. The consequence is that jobs
> that reached the limits are shown as completed successfully in Galaxy.
>
> Interestingly, quite a bit of information can be obtained from the return
> value of s.wait. I was wondering if this can be used to differentiate
> successful from failed jobs. In particular hasExited, hasSignal, and
> terminateSignal are different in the two cases.
>
> Cheers,
> Matthias
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

Reply via email to