Hi, Galaxy Developers,

I have what I hops is somewhat of a basic question regarding Galaxy's
interaction with a pbs job cluster and information reported via the webUI.
Basically, in certain situations, the walltime of a specific job is
exceeded.  This is of course to be expected and all fine and
understandeable.

My problem is that the information is not being relayed back to the end
user via the Galaxy web UI, which causes confusion in our Galaxy user
community.   Basically the Torque scheduler generates the following message
when a walltime is exceeded:

11/04/2013 
08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing
to send 'a' mail for job 163.sctest.cri.uchicago.edu to
s.cri.gal...@crigalaxy-test.uchicago.edu (Job exceeded its walltime limit.
Job was aborted
11/04/2013 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job
exit status -11 handled

Now, my problem is that this status -11 return code is not being correctly
handled by Galaxy.  What happens is that Galaxy throws an exception,
specificially:

10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET
/api/histories/90240358ebde1489 HTTP/1.1" 200 - "
https://crigalaxy-test.uchicago.edu/history"; "Mozilla/5.0 (X11; Linux
x86_64; rv:23.0) Gecko/20100101 Firefox/23.0"
galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137 (2150/
163.sctest.cri.uchicago.edu) PBS job state changed from R to C
galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139 (2150/
163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11
galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled
exception calling fail_job
Traceback (most recent call last):
  File
"/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line
60, in run_next
    method(arg)
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
line 561, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

After this exception occurs, the Galaxy job status via the Web UI is still
reported as "Job is currently running".  It appears that the job will
remain in this state (from the end users perspective) indefinitely.  Has
anybody seen this issue before?

I noticed that return code -11 does not exist in
/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the
JOB_EXIT_STATUS  dictionary.   I tried adding an entry for this, however
when I do the exception changes to:

galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274 (2151/
164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded
galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled
exception calling fail_job
Traceback (most recent call last):
  File
"/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line
60, in run_next
    method(arg)
  File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
line 562, in fail_job
    if pbs_job_state.stop_job:
AttributeError: 'AsynchronousJobState' object has no attribute 'stop_job'

I am wondering if this is a bug or if it is just because I am using a newer
version of TORQUE (I am using TORQUE 4.2.2).

In terms of Galaxy, I am using:

[s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents
changeset:   10408:6822f41bc9bb
branch:      stable
parent:      10393:d05bf67aefa6
user:        Dave Bouvier <d...@bx.psu.edu>
date:        Mon Aug 19 13:06:17 2013 -0400
summary:     Fix for case where running functional tests might overwrite
certain files in database/files.

[s.cri.galaxy@crigalaxy-test galaxy-dist]$

Does anybody know how I could fix this such that walltime exceeded messages
are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2?  Thank you
so much for your input and guidance, and for the ongoing development of
Galaxy.

Dan Sullivan
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to