Matthias, 

We have had this problem on our SGE based installation for years. We referred 
to it as the "green screen of death" - as it would allow a biologist to 
continue analysis using output that was partial, at best, often resulting in 
seemingly successful completion of the entire analysis, but completely bogus 
results (say, cuffdiff killed half way through the genome, but it's green in 
galaxy, so no transcripts on the smaller chromosomes, but no error, either).  

We ended up implementing an external reaper that detected these killed jobs 
from SGE, and notified the user and galaxy post-hoc. It was not a very 
satisfactory solution. We are currently moving to SLURM for other reasons and 
hope the problem will not be present there. 

Regards, 
Curtis


-----Original Message-----
From: galaxy-dev [mailto:galaxy-dev-boun...@lists.galaxyproject.org] On Behalf 
Of Matthias Bernt
Sent: Thursday, June 15, 2017 9:27 AM
To: galaxy-dev@lists.galaxyproject.org
Subject: [galaxy-dev] drmaa job status

Dear list,

I have two question for all DRMAA users. Here is the first one.

I was checking how our queuing system (univa GridEngine) and Galaxy react if 
jobs are submitted that exceed run time or memory limits.

I found out that the python drmaa library cannot query the job status after the 
job is finished (for both successful and unsuccessful jobs).

In lib/galaxy/jobs/runners/drmaa.py the call gives an exception
     self.ds.job_status( external_job_id )

Is this always the case? Or might this be a problem with our GridEngine?

I have attached some code for testing. Here the first call to
s.jobStatus(jobid) works, but the second after s.wait(...) doesn't.
But I get "drmaa.errors.InvalidJobException: code 18: The job specified by the 
'jobid' does not exist."

The same error pops up in the Galaxy logs. The consequence is that jobs that 
reached the limits are shown as completed successfully in Galaxy.

Interestingly, quite a bit of information can be obtained from the return value 
of s.wait. I was wondering if this can be used to differentiate successful from 
failed jobs. In particular hasExited, hasSignal, and terminateSignal are 
different in the two cases.

Cheers,
Matthias

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

Reply via email to