Re: Weird failure last night

Debra S Baddorf Thu, 14 May 2015 09:58:38 -0700

Well,  the other way around it   …. and I’m considering it ….. is to revert
to doing  “auth=bsd”  type backups.   Using all  UDP  connections,  rather than
TCP  ones.


Deb Baddorf
Fermilab

On May 14, 2015, at 11:52 AM, Debra S Baddorf <[email protected]> wrote:

>> Can 
>> this timeout be adjusted, or amanda somehow reconfigured to tolerate it 
>> better?  Currently:
> 
> Basically:  NO.  I’ve tried all the amanda params  I can find.  It’s 
> external, and built into the TCP system
> and values.    It requires recoding inside of amanda,  and nobody has had 
> time (& knowledge)  to do it.   
> I’ve thought about it ….   it’s on a deep back burner …..
> You have to send a connect request with “no-wait”  and then later send 
> another to actually decide if the node is
> there.
> 
> Some stuff I tried   (besides all the amanda params I could find):
> From  https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> These are all external  TCP  values,  not amanda values:
> 
> tcp_syn_retries - INTEGER       mine’s 5
>       Number of times initial SYNs for an active TCP connection attempt
>       will be retransmitted. Should not be higher than 255. Default value    
> BINGO-
>       is 6, which corresponds to 63seconds till the last retransmission       
>  THIS
>       with the current initial RTO of 1second. With this the final timeout    
>  WORKED
>       for an active TCP connection attempt will happen after 127seconds.
> 
> tcp_retries1 - INTEGER           mine’s 3
>       This value influences the time, after which TCP decides, that
>       something is wrong due to unacknowledged RTO retransmissions,
>       and reports this suspicion to the network layer.
>       See tcp_retries2 for more details.
> 
>       RFC 1122 recommends at least 3 retransmissions, which is the
>       default.
> 
> Oooo --  I’m seeing 3 minutes 9 seconds .....   which is about 3  
> (tcp_retries)  times 
> the  above  63secs mentioned in  tcp_syn_retries   (except that mine isn’t at 
> 6, but 5).  Hmm.
> 
> Per my side column comment above,  I actually changed my system level of this 
> value,  and
> saw a change in the length of time that amanda had to wait to know that a 
> node was down.
>         (as noticed by “amcheck” )
> 
> SO   I tried some values of  tcp_sys_retries …. for my whole server node  
> (which affects ALL the tcp
> mind you ….. I could try it, since my node only does backups):
> 
> This param defaults to 5.   It takes 3 min 9 seconds to learn that a node is 
> not on the network.
> 
> At 1,    it only took 9 seconds!
> At 2,    it only took 21 seconds.
> At 3,    it took 45 seconds.     I think I’ll try this value (3)   for a bit 
> and see if missing nodes
> continue to bother other (mostly nonUDP)  nodes.
> 
> I did              /sbin/sysctl   -w   net.ipv4.tcp_syn_retries=3
> 
> =================
> 
> HOWEVER -- I left two nodes off,  but  trying to do backups of them,
> with the above param set to 3   
> AND IT DIDN’T HELP AT ALL.   THE WHOLE BLASTED BACKUP SET
> ALL FAILED!!!    i.e.  the other 36  nodes all failed,  rather than just a 
> few of them.
>       as in — the parameter had more negative effects than it had positive 
> effects
> 
> 
> Sooooo   I believe it requires a code solution,  inside the amanda code.
> 
> Currently,  I run a cron job to check that all my nodes are up, before I 
> leave each evening.  
> Sometimes they still fail later,  and I have trouble.   I have no current 
> solution.  :(
> 
> Deb Baddorf
> Fermilab
> 
> On May 14, 2015, at 4:38 AM, Gene Heskett <[email protected]> wrote:
> 
>> On Wednesday 13 May 2015 13:10:47 Debra S Baddorf wrote:
>>> Several of us have problems where TCP based connections to nodes that
>>> are down. (depending on the “auth” method in use) take long enough to
>>> timeout  that they cause failures on other nodes which are NOT down.
>>>      I have older nodes which are still using udp connections, which
>>> do not ever have this problem. Does this sound like it may be involved
>>> in your case?  “auth=bsdtcp”  is the default  with all the modern
>>> versions of amanda.     I have the problem if the node down is  
>>> “auth=bsdtcp”  or “auth=krb5”  but not on the  “auth=bsd” nodes.
>>> 
>>> Deb Baddorf
>>> Fermilab
>> 
>> And I am using bsdtcp, so this fits the observed bahviour to a T.  Can 
>> this timeout be adjusted, or amanda somehow reconfigured to tolerate it 
>> better?  Currently:
>> 
>> ctimeout = 4 (was 7, but both machines respond much quicker than that)
>> 
>> dtimeout = 1200 (this is 40 minutes, but with decent sata drives could 
>> shrink even more)
>> 
>> etimeout = 300 (I don't recall the mode ATM, may be client estimate, its 
>> quick)
>> 
>> I don't see anything else in the manpage that looks applicable.
>> 
>> Thanks Deb.  With that machine alive last night, it all Just Worked(TM).
>> 
>>> On May 13, 2015, at 8:49 AM, Gene Heskett <[email protected]> wrote:
>>>> Greetings all;
>>>> 
>>>> Running 3.3.7p1 on assorted client installs.
>>>> 
>>>> No clue why, but it appears that one of the machines (alias "shop")
>>>> in the shop crashed last night, sometime after I closed up for the
>>>> evening. I've also lost my ssh -Y link to it, and no ping response.
>>>> 
>>>> When amdump ran, it was of course un-available.
>>>> 
>>>> But that seems to have killed the backups for this machine too, as
>>>> was only able to backup the lathe.
>>>> 
>>>> amcheck, when I ran it just now, only reported;
>>>> "WARNING: shop: selfcheck request failed: No route to host"
>>>> One problem.
>>>> 
>>>> But amdump totally failed this machine, "coyote", too. What sort of
>>>> gremlins might do that?
>>>> 
>>>> Cheers, Gene Heskett
>>>> --
>>>> "There are four boxes to be used in defense of liberty:
>>>> soap, ballot, jury, and ammo. Please use in that order."
>>>> -Ed Howdershelt (Author)
>>>> Genes Web page <http://geneslinuxbox.net:6309/gene>
>> 
>> Cheers, Gene Heskett
>> -- 
>> "There are four boxes to be used in defense of liberty:
>> soap, ballot, jury, and ammo. Please use in that order."
>> -Ed Howdershelt (Author)
>> Genes Web page <http://geneslinuxbox.net:6309/gene>
>> 
>

Re: Weird failure last night

Reply via email to