Re: Feature request: halt on threshold

Ole Tange Fri, 18 Jul 2014 16:18:27 -0700

On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme <rusho...@caltech.edu> wrote:


> There are currently three options to "—halt" - ignore (0), stop new jobs (1), 
> or kill everything (2).
>
> I propose an additional option; to set the number of job failures before 
> doing anything. This would then allow some tolerance of failure but would 
> catch global problems.
>
> Consider this example - running a 1000 jobs each of around 1hr, where a 
> random handful will fail due to unexpected bad data or other unforeseen bug, 
> but the overwhelming majority will complete successfully.
>
> Setting —halt 0 all jobs will run, and I can check for the failures 
> afterwards. Great! However, say I forget to create the results directory, so 
> every "good" job runs for full time then fails right at the end…if I wasn’t 
> monitoring I just wasted 1000hrs of processing time.

This I do not understand. GNU Parallel 20140622 creates the dirs
before running, so your version is broken:

$ parallel --results /tmp/this/does/not/exist echo ::: 1
1
$ ls /tmp/this/does/not/exist/1/1/
stderr  stdout

> Setting halt > 0 the job will stop at or just after the first problem. I have 
> to check the logs, figure out and fix if possible, rerun with previous 
> success excluded etc.

Using --resume-failed.

> What I would like is to say set the number of tolerable failures to the 
> number of workers. Then a serious bug would be caught after the first 
> iteration, but the entire job would run and handle some measure of bad input 
> data.

You need to give a reproducible example where you cannot just use
--halt 0 and then later --resume-failed when you have fixed the
bug/the input data.

> Does this make sense? Unfortunately it would require changing the current 
> flags, either adding another or changing the current halt options.

One possibility for syntax is --halt 10% to allow 10% to fail.


/Ole

Re: Feature request: halt on threshold

Reply via email to