> Can > this timeout be adjusted, or amanda somehow reconfigured to tolerate it > better? Currently:
Basically: NO. I’ve tried all the amanda params I can find. It’s external, and built into the TCP system and values. It requires recoding inside of amanda, and nobody has had time (& knowledge) to do it. I’ve thought about it …. it’s on a deep back burner ….. You have to send a connect request with “no-wait” and then later send another to actually decide if the node is there. Some stuff I tried (besides all the amanda params I could find): >From https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt These are all external TCP values, not amanda values: tcp_syn_retries - INTEGER mine’s 5 Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 255. Default value BINGO- is 6, which corresponds to 63seconds till the last retransmission THIS with the current initial RTO of 1second. With this the final timeout WORKED for an active TCP connection attempt will happen after 127seconds. tcp_retries1 - INTEGER mine’s 3 This value influences the time, after which TCP decides, that something is wrong due to unacknowledged RTO retransmissions, and reports this suspicion to the network layer. See tcp_retries2 for more details. RFC 1122 recommends at least 3 retransmissions, which is the default. Oooo -- I’m seeing 3 minutes 9 seconds ..... which is about 3 (tcp_retries) times the above 63secs mentioned in tcp_syn_retries (except that mine isn’t at 6, but 5). Hmm. Per my side column comment above, I actually changed my system level of this value, and saw a change in the length of time that amanda had to wait to know that a node was down. (as noticed by “amcheck” ) SO I tried some values of tcp_sys_retries …. for my whole server node (which affects ALL the tcp mind you ….. I could try it, since my node only does backups): This param defaults to 5. It takes 3 min 9 seconds to learn that a node is not on the network. At 1, it only took 9 seconds! At 2, it only took 21 seconds. At 3, it took 45 seconds. I think I’ll try this value (3) for a bit and see if missing nodes continue to bother other (mostly nonUDP) nodes. I did /sbin/sysctl -w net.ipv4.tcp_syn_retries=3 ================= HOWEVER -- I left two nodes off, but trying to do backups of them, with the above param set to 3 AND IT DIDN’T HELP AT ALL. THE WHOLE BLASTED BACKUP SET ALL FAILED!!! i.e. the other 36 nodes all failed, rather than just a few of them. as in — the parameter had more negative effects than it had positive effects Sooooo I believe it requires a code solution, inside the amanda code. Currently, I run a cron job to check that all my nodes are up, before I leave each evening. Sometimes they still fail later, and I have trouble. I have no current solution. :( Deb Baddorf Fermilab On May 14, 2015, at 4:38 AM, Gene Heskett <[email protected]> wrote: > On Wednesday 13 May 2015 13:10:47 Debra S Baddorf wrote: >> Several of us have problems where TCP based connections to nodes that >> are down. (depending on the “auth” method in use) take long enough to >> timeout that they cause failures on other nodes which are NOT down. >> I have older nodes which are still using udp connections, which >> do not ever have this problem. Does this sound like it may be involved >> in your case? “auth=bsdtcp” is the default with all the modern >> versions of amanda. I have the problem if the node down is >> “auth=bsdtcp” or “auth=krb5” but not on the “auth=bsd” nodes. >> >> Deb Baddorf >> Fermilab > > And I am using bsdtcp, so this fits the observed bahviour to a T. Can > this timeout be adjusted, or amanda somehow reconfigured to tolerate it > better? Currently: > > ctimeout = 4 (was 7, but both machines respond much quicker than that) > > dtimeout = 1200 (this is 40 minutes, but with decent sata drives could > shrink even more) > > etimeout = 300 (I don't recall the mode ATM, may be client estimate, its > quick) > > I don't see anything else in the manpage that looks applicable. > > Thanks Deb. With that machine alive last night, it all Just Worked(TM). > >> On May 13, 2015, at 8:49 AM, Gene Heskett <[email protected]> wrote: >>> Greetings all; >>> >>> Running 3.3.7p1 on assorted client installs. >>> >>> No clue why, but it appears that one of the machines (alias "shop") >>> in the shop crashed last night, sometime after I closed up for the >>> evening. I've also lost my ssh -Y link to it, and no ping response. >>> >>> When amdump ran, it was of course un-available. >>> >>> But that seems to have killed the backups for this machine too, as >>> was only able to backup the lathe. >>> >>> amcheck, when I ran it just now, only reported; >>> "WARNING: shop: selfcheck request failed: No route to host" >>> One problem. >>> >>> But amdump totally failed this machine, "coyote", too. What sort of >>> gremlins might do that? >>> >>> Cheers, Gene Heskett >>> -- >>> "There are four boxes to be used in defense of liberty: >>> soap, ballot, jury, and ammo. Please use in that order." >>> -Ed Howdershelt (Author) >>> Genes Web page <http://geneslinuxbox.net:6309/gene> > > Cheers, Gene Heskett > -- > "There are four boxes to be used in defense of liberty: > soap, ballot, jury, and ammo. Please use in that order." > -Ed Howdershelt (Author) > Genes Web page <http://geneslinuxbox.net:6309/gene> >
