Re: [galaxy-dev] resubmission on out of memory

John Chilton Thu, 21 Sep 2017 06:52:30 -0700

Something like this is possible with some caveats. It is possible to
detect memory and walltime errors - but not based on regex in tools
but instead by the job runner. So the SLURM runner implements
detection of out of memory errors and timeout I think - I don't think
most of the other runners do.

When I started hacking on this feature, there was no documentation for
it and I wanted to understand how it worked and verify that it worked
so I wrote a test case. The problem is the test case tests a bunch of
different features all at once - so it will be a lot to walk through
and you will need to understand dynamic job destinations and such:

https://github.com/galaxyproject/galaxy/blob/dev/test/integration/resubmission_job_conf.xml
https://github.com/galaxyproject/galaxy/commit/0559cff6e94b250ddd98275b119ab51b36491e34

That said let me see if I can come up with a simple example:

<job_conf>
<plugins>
<!-- setup a slurm runner or update another runner to detect these
conditions and set it up here -->
</plugins>

<destinations default="small_fast_host">
  <destination id="small_fast_host" runner="slurm">
    <param 
name="native_specification>SHORT_WALLTIME_SMALL_MEMORY_OPTS_FOR_YOUR_CLUSTER</param>
    <resubmit condition="walltime_reached" destination="longer_walltime_dest" />
    <resubmit condition="memory_limit_reached"
destination="bigger_memory_dest" />
    <resubmit condition="seconds_running &lt; 5 and attempts &lt 3"
delay="attempt * 1.5" destination="small_fast_host" />
  </destination>
  <destination id="longer_walltime_dest" runner="slurm">
    <param name="native_specification>LONGER_WALLTIME_FOR_YOUR_CLUSTERS</param>
 </destination>
  <destination id="bigger_memory_dest" runner="slurm">
    <param name="native_specification>BIGGER_MEMORY_FOR_YOUR_CLUSTERS</param>
  </destination>
</destination>

<tools />
</job_conf>

Here you would fill in native_specifications for your various runners
to redirect jobs as needed. Everything is going through an initial
destination (though you could parameterize this and have any number of
initial destinations). That destinations is going to resubmit under 3
different conditions - if a walltime error is detected by the job
runner - it will resubmit to a destination that you have to configure
with a longer walltime (with id="longer_walltime_dest") - perhaps this
is a different cluster with longer wait times and corresponding longer
walltimes. Likewise if a memory error is detected - it will resubmit
to "bigger_memory_dest" (perhaps a special part of your cluster with
larger memory servers or a large shared memory machine). Finally to
show off some coolness I added - if the job fails right away (within
the first 5 seconds) - it will delay the job a bit and then retry to
submit up to 5 times. This may be good at working around random
cluster failures during submissions if things get busy.

The test case covers allowing users to supply parameters to assist
with finding destinations and controlling resubmission as well dynamic
destinations and how they may interact with these concepts.

Like you mentioned - it would be wonderful if tools could look at
their output and determine if memory problems were encountered - I
guess this is tracked here
(https://github.com/galaxyproject/galaxy/issues/3107). It is a medium
priority for me - so I may get to it at some point. This sort of thing
is important when scaling up analyses.

-John

On Tue, Sep 19, 2017 at 4:26 PM, Matthias Bernt <[email protected]> wrote:
> Dear list,
>
> I recall that its possible to configure a tool can such that out of memory
> conditions (and run time) can be recognized (by regexp matching on
> stadout/stderr). Can this be used to trigger job resubmission on the
> cluster?
>
> Could someone please point me to some kind of documentation, if this is the
> case?
>
> Best,
> Matthias
>
> --
>
> -------------------------------------------
> Matthias Bernt
> Bioinformatics Service
> Molekulare Systembiologie (MOLSYB)
> Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
> Helmholtz Centre for Environmental Research GmbH - UFZ
> Permoserstraße 15, 04318 Leipzig, Germany
> Phone +49 341 235 482296,
> [email protected], www.ufz.de
>
> Sitz der Gesellschaft/Registered Office: Leipzig
> Registergericht/Registration Office: Amtsgericht Leipzig
> Handelsregister Nr./Trade Register Nr.: B 4703
> Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig
> Wilfried Kraus
> Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
> Prof. Dr. Dr. h.c. Georg Teutsch
> Administrative Geschäftsführerin/ Administrative Managing Director:
> Prof. Dr. Heike Graßmann
> -------------------------------------------
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

Re: [galaxy-dev] resubmission on out of memory

Reply via email to