Hi, There are currently three options to "—halt" - ignore (0), stop new jobs (1), or kill everything (2).
I propose an additional option; to set the number of job failures before doing anything. This would then allow some tolerance of failure but would catch global problems. Consider this example - running a 1000 jobs each of around 1hr, where a random handful will fail due to unexpected bad data or other unforeseen bug, but the overwhelming majority will complete successfully. Setting —halt 0 all jobs will run, and I can check for the failures afterwards. Great! However, say I forget to create the results directory, so every "good" job runs for full time then fails right at the end…if I wasn’t monitoring I just wasted 1000hrs of processing time. Setting halt > 0 the job will stop at or just after the first problem. I have to check the logs, figure out and fix if possible, rerun with previous success excluded etc. What I would like is to say set the number of tolerable failures to the number of workers. Then a serious bug would be caught after the first iteration, but the entire job would run and handle some measure of bad input data. Does this make sense? Unfortunately it would require changing the current flags, either adding another or changing the current halt options. Thanks, Ben