On one of my older setups that is running Amanda 2.5.1p3, I'm getting patterns of failures that I can't make sense of. Some days everything works just fine. Other days I get everything from a couple of DLEs failing to a half dozen. They are typically the same three servers, which are in another building. However, there is a server that has the largest DLEs in that same building that does not exhibit failures. There are also several servers in the same building as the Amanda backup server that don't show any failures. I even made a spreadsheet that shows a 0, 1 or 2 for backup levels of successes and a red x for failures. Can't see any pattern.

I've looked at a bunch of things and pored over the log files to no avail.

The errors show up in the Amanda reports as:

  anise.nsm.umass.edu        /home     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  metzi1.physics.umass.edu   /data     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  anise.nsm.umass.edu        /home     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  anise.nsm.umass.edu        /home     lev 1  FAILED [too many dumper retry: 
"[request failed: timeout waiting for ACK]"]
  metzi1.physics.umass.edu   /data     lev 1  FAILED [too many dumper retry: 
"[request failed: timeout waiting for ACK]"]
  metzi1.physics.umass.edu   /data     lev 1  FAILED [cannot read header: got 0 
instead of 32768]

The interesting thing is that if I go to metzi1, into /tmp/amanda/client/daily/ and do an `ls *0508*` I can see the debug logs from last night. If I do a `grep '/data' *0508*` I can see any entries that mention the DLE /data. I see no instances of sendbackup. I see only those runtar debug files that correspond to the size estimates for 0, 1 and 2 level backups. There is no runtar that would correspond to an actual dump.

If I do the same thing with `grep '/home' *0508*` (the DLE /home was successfully backed up), then I see all the runtar debug files for the estimates as well as a runtar debug file for the actual backup. I also see several lines in the sendbackup debug file for /home.

I've also looked through /var/log/syslog, /var/log/auth.log, etc. on the client (which is Ubuntu 12.04 LTS), and I've looked through Amanda debug logs, /var/adm/messages, /var/adm/authlog, etc. on the server (which is Solaris 10). I don't see any logged errors for dropped connections or failures of any sort. The Amanda logs just don't mention /data on metzi1. A couple of the other servers that are being backed up are Ubuntu 12.04 and several are Ubuntu 10.04. None of them have been tweeked for sshd_config. All have tcpkeepalive turned on.

I tried bumping up the timeouts in amanda.conf (by a factor of 5). That seems a bit much, and it didn't seem to make any difference.

What should I be looking for? Where would Amanda log what is going on? (Or, why would it not be logging it?)


Thank you,


--
---------------

Chris Hoogendyk

-
   O__  ---- Systems Administrator
  c/ /'_ --- Biology & Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst

<hoogen...@bio.umass.edu>

---------------

Erdös 4

Reply via email to