On Sat, Aug 16, 2014 at 2:59 AM, Douglas A. Augusto <daaugu...@gmail.com> wrote: > On 14/08/2014 at 09:30, > Ole Tange <o...@tange.dk> wrote: > >> Removing is, however, a completely different ballgame: What do you do >> about jobs currently running on the servers? Also there is no >> infrastructure to say: Don't start new jobs on this server and remove >> it when the last job completes. The easiest is probably to add a 'dont >> start new jobs' flag to the server-object, and leave the data >> structure in place. It will, however, cost a few cycles to skip the >> server every time a new job is started. >> >> --filter-hosts does the "removal" by not adding the host in the first place. > > In order to make GNU Parallel more resilient, particularly when running jobs > on > remote servers over unreliable internet connections, I think it should be able > to detect when a server is "down" and when it is "back" again. This would be > like a dynamic "--filter-hosts".
I have been thinking along the same lines, but have been unable to find an easy way of doing that in practice. Here are some of the problems: There is only one valid test to see if a machine is up and that is by doing a ssh to run a command (such as /bin/true). We cannot assume the the hostname given is known to DNS: I have several host aliases that are only defined in my .ssh/config and which are behind a firewall that you first have to log into. SSH works fine, but ping would fail miserably. GNU Parallel is (as crazy as it sounds) mostly serial. So if we need to run a test before starting a new job, then all other jobs will be delayed. Should we test if a server is down before running spawning a new job? If the jobs are long, then the added time for the test might not be too bad. But if the jobs are short then this will add considerable time to running the job. And if the server is dead, we will have to wait for a timeout - delaying jobs even further. We can assume that the server is up, if a job completes without error. If there is an error, we cannot tell whether the job returned an error or if ssh failed (i.e. the server is down). But if jobs fail it is clearly an indication that the server could be down. So the test could be done here: Check if the server is down, if a job fails. That will delay a short time if the server is up, and delay a longer time if the server is down. Now let us assume server1 is down and removed. How do we add it back? When should we retry if server1 is up? A failed try is expensive as that delays everything (A simple test of ssh indicates it takes 120 seconds to timeout.) A way to mitigate this could be to timeout earlier. We know how log it used to take to login to the host, so we could use a timeout that is 10 times the original time. Another way would be to add server1 back after some timeout and let it be kicked again if a job fails on it again - doubling the timeout before it can be considered again. All in all doable, but it does not seem trivially simple. > With respect to what to do with jobs currently running on the servers, I think > GNU Parallel should simply wait until they complete. If the user really wants > to kill them, he or she could do this manually (perhaps even using a second > instance of GNU Parallel to issue a "broadcast kill"). Alternatively this > decision could be left to the user via a user-defined parameter. Having given this a bit more thought, it should be possible to set the number of jobslots on a host to 0. That would have the effect that no new jobs would be spawned. /Ole