Re: Weird failure last night

Debra S Baddorf Thu, 14 May 2015 09:56:43 -0700

> Can 
> this timeout be adjusted, or amanda somehow reconfigured to tolerate it 
> better?  Currently:


Basically:  NO.  I’ve tried all the amanda params  I can find.  It’s external, 
and built into the TCP system
and values.    It requires recoding inside of amanda,  and nobody has had time 
(& knowledge)  to do it.   
I’ve thought about it ….   it’s on a deep back burner …..
You have to send a connect request with “no-wait”  and then later send another 
to actually decide if the node is
there.

Some stuff I tried   (besides all the amanda params I could find):
>From  https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
These are all external  TCP  values,  not amanda values:

tcp_syn_retries - INTEGER       mine’s 5
        Number of times initial SYNs for an active TCP connection attempt
        will be retransmitted. Should not be higher than 255. Default value    
BINGO-
        is 6, which corresponds to 63seconds till the last retransmission       
 THIS
        with the current initial RTO of 1second. With this the final timeout    
 WORKED
        for an active TCP connection attempt will happen after 127seconds.

tcp_retries1 - INTEGER           mine’s 3
        This value influences the time, after which TCP decides, that
        something is wrong due to unacknowledged RTO retransmissions,
        and reports this suspicion to the network layer.
        See tcp_retries2 for more details.

        RFC 1122 recommends at least 3 retransmissions, which is the
        default.

Oooo --  I’m seeing 3 minutes 9 seconds .....   which is about 3  (tcp_retries) 
 times 
the  above  63secs mentioned in  tcp_syn_retries   (except that mine isn’t at 
6, but 5).  Hmm.

Per my side column comment above,  I actually changed my system level of this 
value,  and
saw a change in the length of time that amanda had to wait to know that a node 
was down.
         (as noticed by “amcheck” )

SO   I tried some values of  tcp_sys_retries …. for my whole server node  
(which affects ALL the tcp
mind you ….. I could try it, since my node only does backups):

This param defaults to 5.   It takes 3 min 9 seconds to learn that a node is 
not on the network.

At 1,    it only took 9 seconds!
At 2,    it only took 21 seconds.
At 3,    it took 45 seconds.     I think I’ll try this value (3)   for a bit 
and see if missing nodes
continue to bother other (mostly nonUDP)  nodes.

I did              /sbin/sysctl   -w   net.ipv4.tcp_syn_retries=3

=================

HOWEVER -- I left two nodes off,  but  trying to do backups of them,
with the above param set to 3   
AND IT DIDN’T HELP AT ALL.   THE WHOLE BLASTED BACKUP SET
ALL FAILED!!!    i.e.  the other 36  nodes all failed,  rather than just a few 
of them.
       as in — the parameter had more negative effects than it had positive 
effects


Sooooo   I believe it requires a code solution,  inside the amanda code.

Currently,  I run a cron job to check that all my nodes are up, before I leave 
each evening.  
Sometimes they still fail later,  and I have trouble.   I have no current 
solution.  :(

Deb Baddorf
Fermilab

On May 14, 2015, at 4:38 AM, Gene Heskett <[email protected]> wrote:

> On Wednesday 13 May 2015 13:10:47 Debra S Baddorf wrote:
>> Several of us have problems where TCP based connections to nodes that
>> are down. (depending on the “auth” method in use) take long enough to
>> timeout  that they cause failures on other nodes which are NOT down.
>>       I have older nodes which are still using udp connections, which
>> do not ever have this problem. Does this sound like it may be involved
>> in your case?  “auth=bsdtcp”  is the default  with all the modern
>> versions of amanda.     I have the problem if the node down is  
>> “auth=bsdtcp”  or “auth=krb5”  but not on the  “auth=bsd” nodes.
>> 
>> Deb Baddorf
>> Fermilab
> 
> And I am using bsdtcp, so this fits the observed bahviour to a T.  Can 
> this timeout be adjusted, or amanda somehow reconfigured to tolerate it 
> better?  Currently:
> 
> ctimeout = 4 (was 7, but both machines respond much quicker than that)
> 
> dtimeout = 1200 (this is 40 minutes, but with decent sata drives could 
> shrink even more)
> 
> etimeout = 300 (I don't recall the mode ATM, may be client estimate, its 
> quick)
> 
> I don't see anything else in the manpage that looks applicable.
> 
> Thanks Deb.  With that machine alive last night, it all Just Worked(TM).
> 
>> On May 13, 2015, at 8:49 AM, Gene Heskett <[email protected]> wrote:
>>> Greetings all;
>>> 
>>> Running 3.3.7p1 on assorted client installs.
>>> 
>>> No clue why, but it appears that one of the machines (alias "shop")
>>> in the shop crashed last night, sometime after I closed up for the
>>> evening. I've also lost my ssh -Y link to it, and no ping response.
>>> 
>>> When amdump ran, it was of course un-available.
>>> 
>>> But that seems to have killed the backups for this machine too, as
>>> was only able to backup the lathe.
>>> 
>>> amcheck, when I ran it just now, only reported;
>>> "WARNING: shop: selfcheck request failed: No route to host"
>>> One problem.
>>> 
>>> But amdump totally failed this machine, "coyote", too. What sort of
>>> gremlins might do that?
>>> 
>>> Cheers, Gene Heskett
>>> --
>>> "There are four boxes to be used in defense of liberty:
>>> soap, ballot, jury, and ammo. Please use in that order."
>>> -Ed Howdershelt (Author)
>>> Genes Web page <http://geneslinuxbox.net:6309/gene>
> 
> Cheers, Gene Heskett
> -- 
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> Genes Web page <http://geneslinuxbox.net:6309/gene>
>

Re: Weird failure last night

Reply via email to