Well, the other way around it …. and I’m considering it ….. is to revert to doing “auth=bsd” type backups. Using all UDP connections, rather than TCP ones.
Deb Baddorf Fermilab On May 14, 2015, at 11:52 AM, Debra S Baddorf <[email protected]> wrote: >> Can >> this timeout be adjusted, or amanda somehow reconfigured to tolerate it >> better? Currently: > > Basically: NO. I’ve tried all the amanda params I can find. It’s > external, and built into the TCP system > and values. It requires recoding inside of amanda, and nobody has had > time (& knowledge) to do it. > I’ve thought about it …. it’s on a deep back burner ….. > You have to send a connect request with “no-wait” and then later send > another to actually decide if the node is > there. > > Some stuff I tried (besides all the amanda params I could find): > From https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt > These are all external TCP values, not amanda values: > > tcp_syn_retries - INTEGER mine’s 5 > Number of times initial SYNs for an active TCP connection attempt > will be retransmitted. Should not be higher than 255. Default value > BINGO- > is 6, which corresponds to 63seconds till the last retransmission > THIS > with the current initial RTO of 1second. With this the final timeout > WORKED > for an active TCP connection attempt will happen after 127seconds. > > tcp_retries1 - INTEGER mine’s 3 > This value influences the time, after which TCP decides, that > something is wrong due to unacknowledged RTO retransmissions, > and reports this suspicion to the network layer. > See tcp_retries2 for more details. > > RFC 1122 recommends at least 3 retransmissions, which is the > default. > > Oooo -- I’m seeing 3 minutes 9 seconds ..... which is about 3 > (tcp_retries) times > the above 63secs mentioned in tcp_syn_retries (except that mine isn’t at > 6, but 5). Hmm. > > Per my side column comment above, I actually changed my system level of this > value, and > saw a change in the length of time that amanda had to wait to know that a > node was down. > (as noticed by “amcheck” ) > > SO I tried some values of tcp_sys_retries …. for my whole server node > (which affects ALL the tcp > mind you ….. I could try it, since my node only does backups): > > This param defaults to 5. It takes 3 min 9 seconds to learn that a node is > not on the network. > > At 1, it only took 9 seconds! > At 2, it only took 21 seconds. > At 3, it took 45 seconds. I think I’ll try this value (3) for a bit > and see if missing nodes > continue to bother other (mostly nonUDP) nodes. > > I did /sbin/sysctl -w net.ipv4.tcp_syn_retries=3 > > ================= > > HOWEVER -- I left two nodes off, but trying to do backups of them, > with the above param set to 3 > AND IT DIDN’T HELP AT ALL. THE WHOLE BLASTED BACKUP SET > ALL FAILED!!! i.e. the other 36 nodes all failed, rather than just a > few of them. > as in — the parameter had more negative effects than it had positive > effects > > > Sooooo I believe it requires a code solution, inside the amanda code. > > Currently, I run a cron job to check that all my nodes are up, before I > leave each evening. > Sometimes they still fail later, and I have trouble. I have no current > solution. :( > > Deb Baddorf > Fermilab > > On May 14, 2015, at 4:38 AM, Gene Heskett <[email protected]> wrote: > >> On Wednesday 13 May 2015 13:10:47 Debra S Baddorf wrote: >>> Several of us have problems where TCP based connections to nodes that >>> are down. (depending on the “auth” method in use) take long enough to >>> timeout that they cause failures on other nodes which are NOT down. >>> I have older nodes which are still using udp connections, which >>> do not ever have this problem. Does this sound like it may be involved >>> in your case? “auth=bsdtcp” is the default with all the modern >>> versions of amanda. I have the problem if the node down is >>> “auth=bsdtcp” or “auth=krb5” but not on the “auth=bsd” nodes. >>> >>> Deb Baddorf >>> Fermilab >> >> And I am using bsdtcp, so this fits the observed bahviour to a T. Can >> this timeout be adjusted, or amanda somehow reconfigured to tolerate it >> better? Currently: >> >> ctimeout = 4 (was 7, but both machines respond much quicker than that) >> >> dtimeout = 1200 (this is 40 minutes, but with decent sata drives could >> shrink even more) >> >> etimeout = 300 (I don't recall the mode ATM, may be client estimate, its >> quick) >> >> I don't see anything else in the manpage that looks applicable. >> >> Thanks Deb. With that machine alive last night, it all Just Worked(TM). >> >>> On May 13, 2015, at 8:49 AM, Gene Heskett <[email protected]> wrote: >>>> Greetings all; >>>> >>>> Running 3.3.7p1 on assorted client installs. >>>> >>>> No clue why, but it appears that one of the machines (alias "shop") >>>> in the shop crashed last night, sometime after I closed up for the >>>> evening. I've also lost my ssh -Y link to it, and no ping response. >>>> >>>> When amdump ran, it was of course un-available. >>>> >>>> But that seems to have killed the backups for this machine too, as >>>> was only able to backup the lathe. >>>> >>>> amcheck, when I ran it just now, only reported; >>>> "WARNING: shop: selfcheck request failed: No route to host" >>>> One problem. >>>> >>>> But amdump totally failed this machine, "coyote", too. What sort of >>>> gremlins might do that? >>>> >>>> Cheers, Gene Heskett >>>> -- >>>> "There are four boxes to be used in defense of liberty: >>>> soap, ballot, jury, and ammo. Please use in that order." >>>> -Ed Howdershelt (Author) >>>> Genes Web page <http://geneslinuxbox.net:6309/gene> >> >> Cheers, Gene Heskett >> -- >> "There are four boxes to be used in defense of liberty: >> soap, ballot, jury, and ammo. Please use in that order." >> -Ed Howdershelt (Author) >> Genes Web page <http://geneslinuxbox.net:6309/gene> >> >
