On Tue, Nov 5, 2013 at 11:53 AM, Daniel Patrick Sullivan
<dansulli...@gmail.com> wrote:
> Hi, John,
>
> Thank you for taking the time to help me look into this issue.  I have
> applied the patch you provided and confirmed that it appears to help
> remediate the problem (when a walltime is exceeded feedback is in fact
> provided via the Galaxy web UI; it no longer appears that jobs are running
> indefinitely).    One thing I would like to note is that the error that is
> provided to the user is generic, i.e. the web UI reports "An error occurred
> with this dataset: Job cannot be completed due to a cluster error, please
> retry it later".  So, the fact that a Walltime exceeded error actually
> occurred is not presented to the user (I am not sure if this is intentional
> or not).  Again, I appreciate you taking the time to verify and patch this
> issue.  I have attached a screenshot of the output for your review.

Glad we are making progress - I have committed that previous patch to
galaxy-central. Lets see if we cannot improve the user feedback so
they know they hit the maximum walltime. Can you try this new patch?
The message about the timeout was being built but it was not being
logged not set as the error message on the dataset - this should
resolve that.

>
> I am probably going to be testing Galaxy with Torque 4.2.5 in the coming
> weeks, I will let you know if I identify any additional problems.  Thank you
> so much have a wonderful day.

You too, thanks for working with me on fixing this!

-John

>
> Dan Sullivan
>
>
> On Tue, Nov 5, 2013 at 8:48 AM, John Chilton <chil...@msi.umn.edu> wrote:
>>
>> Hey Daniel,
>>
>> Thanks so much for the details problem report, it was very helpful.
>> Reviewing the code there appears to be a bug in the PBS job runner -
>> in some cases pbs_job_state.stop_job is never set but is attempted to
>> be read. I don't have torque so I don't have a great test setup for
>> this problem, any chance you can make the following changes for me and
>> let me know if they work?
>>
>> Between the following two lines:
>>
>>                     log.error( '(%s/%s) PBS job failed: %s' % (
>> galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
>> 'Unknown error: %s' % status.exit_status ) ) )
>>                     self.work_queue.put( ( self.fail_job, pbs_job_state )
>> )
>>
>>                     log.error( '(%s/%s) PBS job failed: %s' % (
>> galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ),
>> 'Unknown error: %s' % status.exit_status ) ) )
>>                     pbs_job_state.stop_job = False
>>                     self.work_queue.put( ( self.fail_job, pbs_job_state )
>> )
>>
>> And at the top of the file can you add a -11 option to the
>> JOB_EXIT_STATUS to indicate a job timeout.
>>
>> I have attached a patch that would apply against the latest stable -
>> it will probably will work against your branch as well.
>>
>> If you would rather not act as my QC layer, I can try to come up with
>> a way to do some testing on my end :).
>>
>> Thanks again,
>> -John
>>
>>
>> On Mon, Nov 4, 2013 at 10:10 AM, Daniel Patrick Sullivan
>> <dansulli...@gmail.com> wrote:
>> > Hi, Galaxy Developers,
>> >
>> > I have what I hops is somewhat of a basic question regarding Galaxy's
>> > interaction with a pbs job cluster and information reported via the
>> > webUI.
>> > Basically, in certain situations, the walltime of a specific job is
>> > exceeded.  This is of course to be expected and all fine and
>> > understandeable.
>> >
>> > My problem is that the information is not being relayed back to the end
>> > user
>> > via the Galaxy web UI, which causes confusion in our Galaxy user
>> > community.
>> > Basically the Torque scheduler generates the following message when a
>> > walltime is exceeded:
>> >
>> > 11/04/2013
>> > 08:39:45;000d;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;preparing
>> > to
>> > send 'a' mail for job 163.sctest.cri.uchicago.edu to
>> > s.cri.gal...@crigalaxy-test.uchicago.edu (Job exceeded its walltime
>> > limit.
>> > Job was aborted
>> > 11/04/2013
>> > 08:39:45;0009;PBS_Server.30621;Job;163.sctest.cri.uchicago.edu;job exit
>> > status -11 handled
>> >
>> > Now, my problem is that this status -11 return code is not being
>> > correctly
>> > handled by Galaxy.  What happens is that Galaxy throws an exception,
>> > specificially:
>> >
>> > 10.135.217.178 - - [04/Nov/2013:08:39:42 -0500] "GET
>> > /api/histories/90240358ebde1489 HTTP/1.1" 200 -
>> > "https://crigalaxy-test.uchicago.edu/history"; "Mozilla/5.0 (X11; Linux
>> > x86_64; rv:23.0) Gecko/20100101 Firefox/23.0"
>> > galaxy.jobs.runners.pbs DEBUG 2013-11-04 08:39:46,137
>> > (2150/163.sctest.cri.uchicago.edu) PBS job state changed from R to C
>> > galaxy.jobs.runners.pbs ERROR 2013-11-04 08:39:46,139
>> > (2150/163.sctest.cri.uchicago.edu) PBS job failed: Unknown error: -11
>> > galaxy.jobs.runners ERROR 2013-11-04 08:39:46,139 (unknown) Unhandled
>> > exception calling fail_job
>> > Traceback (most recent call last):
>> >   File
>> > "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
>> > line 60, in run_next
>> >     method(arg)
>> >   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>> > line
>> > 561, in fail_job
>> >     if pbs_job_state.stop_job:
>> > AttributeError: 'AsynchronousJobState' object has no attribute
>> > 'stop_job'
>> >
>> > After this exception occurs, the Galaxy job status via the Web UI is
>> > still
>> > reported as "Job is currently running".  It appears that the job will
>> > remain
>> > in this state (from the end users perspective) indefinitely.  Has
>> > anybody
>> > seen this issue before?
>> >
>> > I noticed that return code -11 does not exist in
>> > /group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py under the
>> > JOB_EXIT_STATUS  dictionary.   I tried adding an entry for this, however
>> > when I do the exception changes to:
>> >
>> > galaxy.jobs.runners.pbs ERROR 2013-11-04 10:02:17,274
>> > (2151/164.sctest.cri.uchicago.edu) PBS job failed: job walltime exceeded
>> > galaxy.jobs.runners ERROR 2013-11-04 10:02:17,275 (unknown) Unhandled
>> > exception calling fail_job
>> > Traceback (most recent call last):
>> >   File
>> > "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/__init__.py",
>> > line 60, in run_next
>> >     method(arg)
>> >   File "/group/galaxy_test/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>> > line
>> > 562, in fail_job
>> >     if pbs_job_state.stop_job:
>> > AttributeError: 'AsynchronousJobState' object has no attribute
>> > 'stop_job'
>> >
>> > I am wondering if this is a bug or if it is just because I am using a
>> > newer
>> > version of TORQUE (I am using TORQUE 4.2.2).
>> >
>> > In terms of Galaxy, I am using:
>> >
>> > [s.cri.galaxy@crigalaxy-test galaxy-dist]$ hg parents
>> > changeset:   10408:6822f41bc9bb
>> > branch:      stable
>> > parent:      10393:d05bf67aefa6
>> > user:        Dave Bouvier <d...@bx.psu.edu>
>> > date:        Mon Aug 19 13:06:17 2013 -0400
>> > summary:     Fix for case where running functional tests might overwrite
>> > certain files in database/files.
>> >
>> > [s.cri.galaxy@crigalaxy-test galaxy-dist]$
>> >
>> > Does anybody know how I could fix this such that walltime exceeded
>> > messages
>> > are correctly reporeted via the Galaxy web UI for TORQUE 4.2.2?  Thank
>> > you
>> > so much for your input and guidance, and for the ongoing development of
>> > Galaxy.
>> >
>> > Dan Sullivan
>> >
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   http://lists.bx.psu.edu/
>> >
>> > To search Galaxy mailing lists use the unified search at:
>> >   http://galaxyproject.org/search/mailinglists/
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
# HG changeset patch
# User John Chilton <jmchil...@gmail.com>
# Date 1383706041 21600
#      Tue Nov 05 20:47:21 2013 -0600
# Node ID 186104d8c3c10b8b023b301be375acd1371967a6
# Parent  e290746008f2951df07655df8086015fa3e0cf3f
Attept to improve user feedback for pbs jobs which hit max walltime.

Working with Dan Sullivan on this.

diff -r e290746008f2 -r 186104d8c3c1 lib/galaxy/jobs/runners/pbs.py
--- a/lib/galaxy/jobs/runners/pbs.py	Tue Nov 05 08:44:16 2013 -0600
+++ b/lib/galaxy/jobs/runners/pbs.py	Tue Nov 05 20:47:21 2013 -0600
@@ -34,6 +34,8 @@
 
 __all__ = [ 'PBSJobRunner' ]
 
+CLUSTER_ERROR_MESSAGE = "Job cannot be completed due to a cluster error, please retry it later: %s"
+
 # The last two lines execute the command and then retrieve the command's
 # exit code ($?) and write it to a file.
 pbs_symlink_template = """
@@ -80,7 +82,7 @@
     -6: "job aborted on MOM init, chkpt, ok migrate",
     -7: "job restart failed",
     -8: "exec() of user command failed",
-    -11: "job maximum walltime exceeded",
+    -11: "job maximum walltime exceeded",  # Added by John, not from job.h.
 }
 
 
@@ -413,8 +415,10 @@
                     assert int( status.exit_status ) == 0
                     log.debug("(%s/%s) PBS job has completed successfully" % ( galaxy_job_id, job_id ) )
                 except AssertionError:
-                    pbs_job_state.fail_message = 'Job cannot be completed due to a cluster error, please retry it later'
-                    log.error( '(%s/%s) PBS job failed: %s' % ( galaxy_job_id, job_id, JOB_EXIT_STATUS.get( int( status.exit_status ), 'Unknown error: %s' % status.exit_status ) ) )
+                    exit_status = int( status.exit_status )
+                    error_message = JOB_EXIT_STATUS.get( exit_status, 'Unknown error: %s' % status.exit_status )
+                    pbs_job_state.fail_message = CLUSTER_ERROR_MESSAGE % error_message
+                    log.error( '(%s/%s) PBS job failed: %s' % ( galaxy_job_id, job_id, error_message ) )
                     pbs_job_state.stop_job = False
                     self.work_queue.put( ( self.fail_job, pbs_job_state ) )
                     continue
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to