Re: Dynamically changing remote servers list

Ole Tange Sat, 16 Aug 2014 02:32:53 -0700

On Sat, Aug 16, 2014 at 2:59 AM, Douglas A. Augusto <daaugu...@gmail.com> wrote:
> On 14/08/2014 at 09:30,
> Ole Tange <o...@tange.dk> wrote:
>
>> Removing is, however, a completely different ballgame: What do you do
>> about jobs currently running on the servers? Also there is no
>> infrastructure to say: Don't start new jobs on this server and remove
>> it when the last job completes. The easiest is probably to add a 'dont
>> start new jobs' flag to the server-object, and leave the data
>> structure in place. It will, however, cost a few cycles to skip the
>> server every time a new job is started.
>>
>> --filter-hosts does the "removal" by not adding the host in the first place.
>
> In order to make GNU Parallel more resilient, particularly when running jobs 
> on
> remote servers over unreliable internet connections, I think it should be able
> to detect when a server is "down" and when it is "back" again. This would be
> like a dynamic "--filter-hosts".


I have been thinking along the same lines, but have been unable to
find an easy way of doing that in practice.

Here are some of the problems:

There is only one valid test to see if a machine is up and that is by
doing a ssh to run a command (such as /bin/true). We cannot assume the
the hostname given is known to DNS: I have several host aliases that
are only defined in my .ssh/config and which are behind a firewall
that you first have to log into. SSH works fine, but ping would fail
miserably.

GNU Parallel is (as crazy as it sounds) mostly serial. So if we need
to run a test before starting a new job, then all other jobs will be
delayed.

Should we test if a server is down before running spawning a new job?
If the jobs are long, then the added time for the test might not be
too bad. But if the jobs are short then this will add considerable
time to running the job. And if the server is dead, we will have to
wait for a timeout - delaying jobs even further.

We can assume that the server is up, if a job completes without error.
If there is an error, we cannot tell whether the job returned an error
or if ssh failed (i.e. the server is down). But if jobs fail it is
clearly an indication that the server could be down. So the test could
be done here: Check if the server is down, if a job fails. That will
delay a short time if the server is up, and delay a longer time if the
server is down.

Now let us assume server1 is down and removed. How do we add it back?
When should we retry if server1 is up? A failed try is expensive as
that delays everything (A simple test of ssh indicates it takes 120
seconds to timeout.) A way to mitigate this could be to timeout
earlier. We know how log it used to take to login to the host, so we
could use a timeout that is 10 times the original time.

Another way would be to add server1 back after some timeout and let it
be kicked again if a job fails on it again - doubling the timeout
before it can be considered again.

All in all doable, but it does not seem trivially simple.

> With respect to what to do with jobs currently running on the servers, I think
> GNU Parallel should simply wait until they complete. If the user really wants
> to kill them, he or she could do this manually (perhaps even using a second
> instance of GNU Parallel to issue a "broadcast kill"). Alternatively this
> decision could be left to the user via a user-defined parameter.

Having given this a bit more thought, it should be possible to set the
number of jobslots on a host to 0.

That would have the effect that no new jobs would be spawned.


/Ole

Re: Dynamically changing remote servers list

Reply via email to