On 14/08/2014 at 09:30, Ole Tange <o...@tange.dk> wrote: > --sshloginfile already takes a file, so it will be natural to re-read > that. Probably using this method: > > When a job finishes, and it is more than 1 second since we checked last: > Check if the file has changed modification time. If yes: re-read it.
Dear Ole, Thanks for your reply. That seems to be a nice solution. > Removing is, however, a completely different ballgame: What do you do > about jobs currently running on the servers? Also there is no > infrastructure to say: Don't start new jobs on this server and remove > it when the last job completes. The easiest is probably to add a 'dont > start new jobs' flag to the server-object, and leave the data > structure in place. It will, however, cost a few cycles to skip the > server every time a new job is started. > > --filter-hosts does the "removal" by not adding the host in the first place. In order to make GNU Parallel more resilient, particularly when running jobs on remote servers over unreliable internet connections, I think it should be able to detect when a server is "down" and when it is "back" again. This would be like a dynamic "--filter-hosts". The likelihood of at least one server being inaccessible (temporarily or definitively) increases quickly with the size of the list of servers and processing time; in my opinion, having this kind of feature would make GNU Parallel more robust and scalable. Currently, when a server goes off-line GNU Parallel does not recognize this and tries over and over to schedule jobs on it (and they all fail). With respect to what to do with jobs currently running on the servers, I think GNU Parallel should simply wait until they complete. If the user really wants to kill them, he or she could do this manually (perhaps even using a second instance of GNU Parallel to issue a "broadcast kill"). Alternatively this decision could be left to the user via a user-defined parameter. Best regards, -- Douglas A. Augusto