On 28/08/2014 at 00:34, Ole Tange <o...@tange.dk> wrote: > No idea, but it is very likely that the code has bugs: It is the > youngest code and there is no testing of that part of the code. If you > can show something reproducible then let us fix it. Your description > is unfortunately not enough for me to see if the bug is in GNU > Parallel and in that case where the bug is.
Hi, I've figured out what is going wrong: every time the ssh login file is reloaded it seems that a number of jobs are immediately launched on *all* servers, regardless of their load (number of slots currently used). Suppose we have the following entries in the ssh login file 1/server1.net 5/server5.net and that there are 1 and 5 jobs currently running on server1.net and server5.net, respectively. If this file is reread for whatever reason, then GNU Parallel will launch 1 more job on server1.net (thus a total of 2 jobs) and 5 more jobs on server5.net, totaling 10 jobs there. After that--and if the ssh login file doesn't change anymore--GNU Parallel will behave as expected and will only start new jobs if there is no job running on server1.net and less then 5 jobs running on server5.net. Unfortunately, for any large set of reasonably long running jobs, when the ssh login file changes frequently (which is common on unreliable machines/networks) GNU Parallel will end up launching a zillion jobs on each server, effectively rendering them inoperative. -- Douglas A. Augusto