We've done a lot of work in Galaxy dev on this problem over the last
few years - I'm not sure how much concrete progress we have made.

Nate started it and I did some work at the end of last year. Just to
summarize my most recent work on this - in
https://github.com/galaxyproject/galaxy/pull/3291/commits/b78287f1508db2c06f0c309ed8d3747adb4d17fa
I added some test cases for the existing job runner resubmission stuff
- it was just my sense to understand what was there - hopefully the
examples in the form of test cases help you as well.. This includes a
little test job_conf.xml file that describes how you can catch job
walltime and memory limit hits registered by the job runner and send
jobs to different destinations. This requires the job runner knows how
to record these problems - which the SLURM job runner does - other job
runners like the generic drmaa runner may need to be subclassed to
check for these things in general.

In 
https://github.com/galaxyproject/galaxy/pull/3291/commits/7d52b28ab2ab0314cd4fa31108a6750cb9750ef3
I created a little DSL for resubmissions to make what can be expressed
in job_conf more powerful. Then I added variables to expressions
language such as seconds_since_queued,
seconds_running(https://github.com/galaxyproject/galaxy/pull/3291/commits/18eb1c8d0e4c3f7616d44fd177c90943695b7053),
and attempt number
(https://github.com/galaxyproject/galaxy/pull/3291/commits/7e338d790964f594ae67b33e6a72e1777e774b8c).
I also added the ability to resubmit on unknown job runner problems
here 
(https://github.com/galaxyproject/galaxy/pull/3291/commits/0559cff6e94b250ddd98275b119ab51b36491e34).

None of this is really documented outside the test cases - it is
waiting for someone to come along and find it useful.

I think the next thing I'd like to see for job resubmission besides
documentation and more job runner support for common runners is
described in this issue
(https://github.com/galaxyproject/galaxy/issues/3320) - all the
existing resubmission logic is based on errors detected from job
runners - if the underlying error exhibits itself as a tool failure -
we need a way to reason about that and we cannot currently.

Hope this helps.

-John

On Thu, Jun 15, 2017 at 10:37 AM, Matthias Bernt <m.be...@ufz.de> wrote:
> Dear list,
>
> I was thinking about implementing the job resubmission feature for drmaa.
>
> I hope that I can simplify the job configuration for our installation (and
> probably others as well) by escalating through different queues (or
> ressource limits). Thereby I hope to reduce the number of special cases that
> I need to take care.
>
> I was wondering if there are others
>
> - who are also interested in this feature and want to join? I would try to
> give this project a head start in the next week.
>
> - that may have started to work on this feature or just started to think
> about it and want to share code/experience
>
> Best,
> Matthias
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

Reply via email to