Feature request: halt on threshold

Ben Rusholme Fri, 18 Jul 2014 14:23:31 -0700

Hi,

There are currently three options to "—halt" - ignore (0), stop new jobs (1), 
or kill everything (2).


I propose an additional option; to set the number of job failures before doing 
anything. This would then allow some tolerance of failure but would catch 
global problems.

Consider this example - running a 1000 jobs each of around 1hr, where a random 
handful will fail due to unexpected bad data or other unforeseen bug, but the 
overwhelming majority will complete successfully.

Setting —halt 0 all jobs will run, and I can check for the failures afterwards. 
Great! However, say I forget to create the results directory, so every "good" 
job runs for full time then fails right at the end…if I wasn’t monitoring I 
just wasted 1000hrs of processing time.

Setting halt > 0 the job will stop at or just after the first problem. I have 
to check the logs, figure out and fix if possible, rerun with previous success 
excluded etc.

What I would like is to say set the number of tolerable failures to the number 
of workers. Then a serious bug would be caught after the first iteration, but 
the entire job would run and handle some measure of bad input data.

Does this make sense? Unfortunately it would require changing the current 
flags, either adding another or changing the current halt options.

Thanks, Ben

Feature request: halt on threshold

Reply via email to