Re: problems with some failing backups

Niall O Broin Sun, 05 May 2002 06:25:19 -0700

On Sat, May 04, 2002 at 08:02:58PM -0400, Michael Richardson wrote:

>   I backup 7 local systems with Amanda.
> 
>   Three Linux boxes (1 Debian/i386, 1 RH/i386, 1 RH/Netwinder), and 
> four NetBSD/i386 boxes. There is a NetBSD/ipf firewall between the backup
> server (NetBSD/i386) and some of the boxes. Some of the backups also occur
> over IPsec (yes, even though they are all "local").
> 
>   Two boxes on the same wire as backup server (plus the server itself) 
> work flawlessly. The IPsec connected ones work fine.
> 
>   The three behind the firewall fail frequently, but not 100% of the time.
> I setup backups for just those hosts, and watch with tcpdump. I've built with 
> the appropriate port ranges, but I never seen firewall failures, yet I get
> failures.


Speak to me brother ! I've been posting about a similar problem here but
I've got no responses. Do you get messages like these in the report:

  serv1      /boot lev 0 FAILED [Request to serv1 timed out.]
  serv1      / lev 0 FAILED [Request to serv1 timed out.]
  
My remote (to describe the machines on the other side of the firewall)
backups fail nearly all the time. My boxes are all Linux with large / and
small /boot partitions. Sometimes L0 backups of /boot work, and once or
twice I got an L0 of / to work (of one client) but generally all that works
is when I get L1 of /boot, which is of course tiny.

>   Coincidentally, the machines that fail are all less than 300Mhz systems,
> (233Mhz, 350Mhz, 200Mhz), while the machines that work are 650Mhz+. The
> backup server itself, however is a K5-133 running NetBSD/i386, and a lot of
> SCSI spindles. (Yeah, it needs to be replaced)

I've a different situation - my failing machines are 2 X 1.2 GHz and 1 x
250MHz. However, my firewall is quite a slow box - I can't reach it now to
say exactly. I suspect that the firewall can't handle the load, although I
have clients using NFS accessing servers through it. However, NFS as a
protocol is good at error recovery so that's probably the answer.

>   My impression is that the failures are because the backup time estimates 
> take too long and the backup server gives up on them. One the clients, I
> don't see any errors in the /tmp/amanda output - it looks normal to me.

At the end of amandad.debug on a failing client I see

amandad: sending REP packet:
----
Amanda 2.4 REP HANDLE 002-F8B30708 SEQ 1020382783
OPTIONS maxdumps=1;
/ 0 SIZE 6929200
/boot 0 SIZE 3600
----

amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, giving up!

which is presumably related to the timeout in the mail reports.

>   I've been through the documentation and the FAQs, and I've watched
> tcpdump's of the traffic going through... nothing obvious.

Like you, I've RTFM and STFW but to no avail. I didn't get to the the
tcpdump stage yet, mind you.


Kindest regards,


Niall  O Broin

Re: problems with some failing backups

Reply via email to