Hey Daniel,

Thanks so much for the details problem report, it was very helpful.
Reviewing the code there appears to be a bug in the PBS job runner -
in some cases pbs_job_state.stop_job is never set but is attempted to
be read. I don't have torque so I don't have a great test setup for
this problem, any chance you can make the following changes for me and
let me know if they work?

Between the following two lines:

                    log.error( '(%s/%s) PBS job failed: %s' % (
galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
'Unknown error: %s' % status.exit_status ) ) )
                    self.work_queue.put( ( self.fail_job, pbs_job_state ) )

                    log.error( '(%s/%s) PBS job failed: %s' % (
galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
'Unknown error: %s' % status.exit_status ) ) )
                    pbs_job_state.stop_job = False
                    self.work_queue.put( ( self.fail_job, pbs_job_state ) )

And at the top of the file can you add a -11 option to the
JOB_EXIT_STATUS to indicate a job timeout.

I have attached a patch that would apply against the latest stable -
it will probably will work against your branch as well.

If you would rather not act as my QC layer, I can try to come up with
a way to do some testing on my end :).

Thanks again,
-John


On Mon, Nov 4, 2013 at 10:10 AM, Daniel Patrick Sullivan
<dansulli...@gmail.com> wrote:
> Hi, Galaxy Developers,
>
> I have what I hops is somewhat of a basic question regarding Galaxy's
> interaction with a pbs job cluster and information reported via the webUI.
> Basically, in certain situations, the walltime of a specific job is
> exceeded.  This is of course to be expected and all fine and
> understandeable.
>
> My problem is that the information is not being relayed back to the end user
> via the Galaxy web UI, which causes confusion in our Galaxy user community.
> Basically the Torque scheduler generates the following message when a
> walltime is exceeded:
>
> 11/04/2013
> 08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing to
> send 'a' mail for job 163.sctest.cri.uchicago.edu to
> s.cri.gal...@crigalaxy-test.uchicago.edu (Job exceeded its walltime limit.
> Job was aborted
> 11/04/2013
> 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job exit
> status -11 handled
>
> Now, my problem is that this status -11 return code is not being correctly
> handled by Galaxy.  What happens is that Galaxy throws an exception,
> specificially:
>
> 10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET
> /api/histories/90240358ebde1489 HTTP/1.1" 200 -
> "https://crigalaxy-test.uchicago.edu/history"; "Mozilla/5.0 (X11; Linux
> x86_64; rv:23.0) Gecko/20100101 Firefox/23.0"
> galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137
> (2150/163.sctest.cri.uchicago.edu) PBS job state changed from R to C
> galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139
> (2150/163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11
> galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled
> exception calling fail_job
> Traceback (most recent call last):
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 60, in run_next
>     method(arg)
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
> 561, in fail_job
>     if pbs_job_state.stop_job:
> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>
> After this exception occurs, the Galaxy job status via the Web UI is still
> reported as "Job is currently running".  It appears that the job will remain
> in this state (from the end users perspective) indefinitely.  Has anybody
> seen this issue before?
>
> I noticed that return code -11 does not exist in
> /group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the
> JOB_EXIT_STATUS  dictionary.   I tried adding an entry for this, however
> when I do the exception changes to:
>
> galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274
> (2151/164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded
> galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled
> exception calling fail_job
> Traceback (most recent call last):
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
> line 60, in run_next
>     method(arg)
>   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py", line
> 562, in fail_job
>     if pbs_job_state.stop_job:
> AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'
>
> I am wondering if this is a bug or if it is just because I am using a newer
> version of TORQUE (I am using TORQUE 4.2.2).
>
> In terms of Galaxy, I am using:
>
> [s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents
> changeset:   10408:6822f41bc9bb
> branch:      stable
> parent:      10393:d05bf67aefa6
> user:        Dave Bouvier <d...@bx.psu.edu>
> date:        Mon Aug 19 13:06:17 2013 -0400
> summary:     Fix for case where running functional tests might overwrite
> certain files in database/files.
>
> [s.cri.galaxy@crigalaxy-test galaxy-dist]$
>
> Does anybody know how I could fix this such that walltime exceeded messages
> are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2?  Thank you
> so much for your input and guidance, and for the ongoing development of
> Galaxy.
>
> Dan Sullivan
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
# HG changeset patch
# User John Chilton <jmchil...@gmail.com>
# Date 1383662656 21600
#      Tue Nov 05 08:44:16 2013 -0600
# Branch stable
# Node ID ad0f8741dce5ad23e3d53c0587d57e33baa5f31e
# Parent  5c789ab4144ac9db6c91b5646032894cae016309
Fix stop_job bug in pbs runner and handle exit code -11.

diff -r 5c789ab4144a -r ad0f8741dce5 lib/galaxy/jobs/runners/pbs.py
--- a/lib/galaxy/jobs/runners/pbs.py	Mon Nov 04 15:04:42 2013 -0500
+++ b/lib/galaxy/jobs/runners/pbs.py	Tue Nov 05 08:44:16 2013 -0600
@@ -80,6 +80,7 @@
     -6: "job aborted on MOM init, chkpt, ok migrate",
     -7: "job restart failed",
     -8: "exec() of user command failed",
+    -11: "job maximum walltime exceeded",
 }
 
 
@@ -414,6 +415,7 @@
                 except AssertionError:
                     pbs_job_state.fail_message = 'Job cannot be completed due to a cluster error, please retry it later'
                     log.error( '(%s/%s) PBS job failed: %s' % ( galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ), 'Unknown error: %s' % status.exit_status ) ) )
+                    pbs_job_state.stop_job = False
                     self.work_queue.put( ( self.fail_job, pbs_job_state ) )
                     continue
                 except AttributeError:
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to