On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme <rusho...@caltech.edu> wrote:
> There are currently three options to "—halt" - ignore (0), stop new jobs (1), > or kill everything (2). > > I propose an additional option; to set the number of job failures before > doing anything. This would then allow some tolerance of failure but would > catch global problems. > > Consider this example - running a 1000 jobs each of around 1hr, where a > random handful will fail due to unexpected bad data or other unforeseen bug, > but the overwhelming majority will complete successfully. > > Setting —halt 0 all jobs will run, and I can check for the failures > afterwards. Great! However, say I forget to create the results directory, so > every "good" job runs for full time then fails right at the end…if I wasn’t > monitoring I just wasted 1000hrs of processing time. This I do not understand. GNU Parallel 20140622 creates the dirs before running, so your version is broken: $ parallel --results /tmp/this/does/not/exist echo ::: 1 1 $ ls /tmp/this/does/not/exist/1/1/ stderr stdout > Setting halt > 0 the job will stop at or just after the first problem. I have > to check the logs, figure out and fix if possible, rerun with previous > success excluded etc. Using --resume-failed. > What I would like is to say set the number of tolerable failures to the > number of workers. Then a serious bug would be caught after the first > iteration, but the entire job would run and handle some measure of bad input > data. You need to give a reproducible example where you cannot just use --halt 0 and then later --resume-failed when you have fixed the bug/the input data. > Does this make sense? Unfortunately it would require changing the current > flags, either adding another or changing the current halt options. One possibility for syntax is --halt 10% to allow 10% to fail. /Ole