Re: [galaxy-dev] Galaxy not killing split cluster jobs

Peter Cock Thu, 03 May 2012 08:00:04 -0700

On Thu, May 3, 2012 at 3:54 PM, Dannon Baker <[email protected]> wrote:
>> On a related point, I've noticed sometimes one child job from a split task
>> can fail, yet the rest of the child jobs continue to run on the cluster 
>> wasting
>> CPU time. As soon as one child job dies (assuming there are no plans for
>> attempting a retry), I would like the parent task to kill all the other 
>> children,
>> and fail itself. I suppose you could merge the output of any children which
>> did finish... but it would be simpler not to bother.
>
> Right now, yes, this would make sense- I'll see about adding it.


Great.

> Ultimately we want to build in a mechanism for retrying child tasks that
> fail due to cluster errors, etc, so it isn't necessary to rerun the entire 
> job.

That could be helpful - but also rather fiddly for detecting when it is
appropriate to retry a job or now. For the split-tasks, right now I'm finding
some child-jobs fail when the OS kills them due to running out of RAM -
in which case a neat idea would be to further sub-divide the jobs and
resubmit. This is probably over-engineering though... KISS principle.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Galaxy not killing split cluster jobs

Reply via email to