On Wed, May 7, 2014 at 10:33 AM, Sebastian Eiser <sebastian.ei...@gmail.com> wrote:
> Just a thought, which may be a simple solution, but suiting most people. > > The kernel is pretty good at killing misbehaving jobs. @Ole: can you capture > SIGKILL from a job? Can you record memory usage shortly after SIGKILL? If you see --joblog I capture the exit value and signal that the command died from. If we are talking swapping, the kernel will rarely kill a job. It will do that if we are running out of memory (both virtual and physical), and that situation is way simpler to deal with: cat jobs | parallel -j100% --joblog my_joblog cat jobs | parallel -j50% --resume-failed --joblog my_joblog cat jobs | parallel -j25% --resume-failed --joblog my_joblog cat jobs | parallel -j12% --resume-failed --joblog my_joblog cat jobs | parallel -j6% --resume-failed --joblog my_joblog cat jobs | parallel -j3% --resume-failed --joblog my_joblog cat jobs | parallel -j1% --resume-failed --joblog my_joblog cat jobs | parallel -j1 --resume-failed --joblog my_joblog This should scale up to 256 core-machines. So the above should work if you have disabled swap and enabled the OOM-killer. > Some people disable swap deliberately, so using swap as metric might not be > general enough. The above deals with that situation. /Ole