On 10/17/2014 05:24 PM, Debra S Baddorf wrote:
Amanda experts:
I’m trying to follow up my woes wherein a TCP-flavored client (auth KRB5 in
this case) being offline
cause other backups to fail. I just noticed that all the failing nodes are
(a) UDP type nodes, ie auth=bsd
(b) failing in the estimate phase
Why would one node affect the others?
Because the connect system hang for a long time, and all others client
exceed their timeout.
PS Why, when node ACSY (names changed to protect the innocent) failed to
initially connect,
does amanda still try to re-connect and do estimates? Wouldn’t an initial
failure (during the KRB5
privilege negotiation) cancel the whole process for that node?
Perhaps I should set hostp->up = HOST_DONE after the first connection
failure? Then
it wouldn’t affect other nodes at a later stage in the process. Right?
That’s be an easy code
insertion. (i think)
You can try it, only do it for 'Connection timed out' error.
Jean-Louis
Deb Baddorf
Fermilab
=========== (sorry, debug logs are gone by now. I could re-create them though
by doing it again) ================
These dumps were to tape ad-LTO2-daily-117.
The next 4 tapes Amanda expects to use are: ad-LTO2-daily-118,
ad-LTO2-daily-119, ad-LTO2-daily-120, ad-LTO2-daily-121.
FAILURE DUMP SUMMARY:
planner: ERROR Request to ACSY failed: Connection timed out
<<<<<<<<< THIS NODE IS DOWN
CHAB WWWdata lev 0 FAILED [too many dumper retry: [request failed: timeout
waiting for REP]]
ACSY / lev 0 FAILED [Request to ACSY failed: Connection timed out]
<<<<<<<< YET IT STILL TRIES IT, AND BOTHERS OTHERS
ACSY /boot lev 0 FAILED [Request to ACSY failed: Connection timed out]
<<<<<<<< IN THIS ESTIMATE PHASE
ACSY /data lev 0 FAILED [Request to ACSY failed: Connection timed out]
<<<<<<<<
ADES / lev 0 FAILED [Some estimate timeout on ADES]
ADES /home lev 0 FAILED [Some estimate timeout on ADES]
ADES /opt lev 0 FAILED [Some estimate timeout on ADES]
ADES /usr lev 0 FAILED [Some estimate timeout on ADES]
ADES /var lev 0 FAILED [Some estimate timeout on ADES]
ADES /boot lev 0 FAILED [Some estimate timeout on ADES]
LINA / lev 0 FAILED [Some estimate timeout on LINA]
LINA /var lev 0 FAILED [Some estimate timeout on LINA]
LINA /usr lev 0 FAILED [Some estimate timeout on LINA]
LINA /data lev 0 FAILED [Some estimate timeout on LINA]
ANIM / lev 0 FAILED [Some estimate timeout on ANIM]
ANIM /var lev 0 FAILED [Some estimate timeout on ANIM]
ANIM /usr/local/www lev 0 FAILED [Some estimate timeout on ANIM]
ANIM /usr/home lev 0 FAILED [Some estimate timeout on ANIM]
ANIM /usr/local/apache-tomcat-7.0 lev 0 FAILED [Some estimate timeout on ANIM]
BINA / lev 0 FAILED [Some estimate timeout on BINA]
BINA /var lev 0 FAILED [Some estimate timeout on BINA]
GRAV / lev 0 FAILED [Some estimate timeout on GRAV]
GRAV /var lev 0 FAILED [Some estimate timeout on GRAV]
GUMB / lev 0 FAILED [Some estimate timeout on GUMB]
GUMB /esh lev 0 FAILED [Some estimate timeout on GUMB]
GUMB /home lev 0 FAILED [Some estimate timeout on GUMB]
GUMB /opt lev 0 FAILED [Some estimate timeout on GUMB]
GUMB /usr lev 0 FAILED [Some estimate timeout on GUMB]
GUMB /var lev 0 FAILED [Some estimate timeout on GUMB]
PROT / lev 0 FAILED [Some estimate timeout on PROT]
PROT /var lev 0 FAILED [Some estimate timeout on PROT]
QUAS / lev 0 FAILED [Some estimate timeout on QUAS]
QUAS /var lev 0 FAILED [Some estimate timeout on QUAS]
CHAB WWWdata lev 0 FAILED [cannot read header: got 0 bytes instead of 32768]
CHAB WWWdata lev 0 FAILED [cannot read header: got 0 bytes instead of 32768]