Re: fault tolerance, retry task on different node, recovery orientation?

Ole Tange Thu, 29 May 2014 15:18:06 -0700

As mentioned in the man page: Computers will only be reused if the
number of retries > number of computers (or more correctly:
sshlogins).


The order in which the computer is tested is based on the order values
are extracted from a Perl hash using 'values'. I am still puzzled why
you believe this order will be important. I would believe it is much
more important to know that a computer on which the job has failed
will not be chosen unless number of retries > number of sshlogins.


/Ole

On Thu, May 29, 2014 at 9:27 PM, Mitchell Wyle <[email protected]> wrote:
> Hi Ole,
>
> Thanks for the quick reply.  I meant, if I have 10 SSHLOGIN computers how
> does parallel choose on which one it will dispatch the next job and to which
> one it will dispatch a failed job that it is retrying.  The selection method
> it uses for selecting which computer when it does what the man page says:
> "retry it on another computer."  round-robin is better than random
> (zookeeper) and better than "least loaded."
>
> Thanks again.
>
>
>
>
> On Thu, May 29, 2014 at 12:20 PM, Ole Tange <[email protected]> wrote:
>>
>> On Thu, May 29, 2014 at 8:54 PM, Mitchell Wyle <[email protected]> wrote:
>> > Cool!  I shall try simple --retries and verify it works.    Does it
>> > "round
>> > robin" the tries?  Thanks!
>>
>> No. It does what it says in the man page:
>>
>>        --retries n
>>                 If a job fails, retry it on another computer. Do
>>                 this n times. If there are fewer than n computers
>>                 in --sshlogin GNU parallel will re-use the
>>                 computers. This is useful if some jobs fail for no
>>                 apparent reason (such as network failure).
>>
>> Why do you think it would do something else than what it says in the man
>> page?
>>
>>
>> /Ole
>
>

Re: fault tolerance, retry task on different node, recovery orientation?

Reply via email to