Debian lenny
Amanda vAmanda-2.5.2p1 (latest for lenny)
The amanda server is running on the LAN, backing up the 3 hosts on the DMZ and 
whatever's up on the LAN.

In the past couple days, 2 of my hosts (server.slsware.dmz and ntp.stsware.dmz) 
have stopped returning estimates on some of the DLEs. SSD is a Dell server with 
a SCSI system disk and a pair of SATAs in a RAID1 for huge data. NSD is a 
little VIA box serving NTP. 

The email reports are full of lines like:

>  server.slsware.dmz     /var/www                lev 0  FAILED [disk /var/www, 
> all estimate timed out]


Amcheck says all is well (valid tape and no problems with clients). I assume 
that means that the amanda server can get to amandad on all the clients, so 
there's no problem with the PIX firewall between the LAN and the DMZ or with 
the packet filters on any of the hosts.

When a backup runs, NSD gives no estimate at all (only one DLE) and SSD gives 
none for most of its DLEs (but not all). 

When a backup starts, the CPU usage goes up on both machines and top says tar 
is the most active program. But after a little bit, tar is no longer running 
and the CPU usage is back to normal. And a bit later, all the other hosts have 
returned all their estimates. But SSD and NSD never do.

I don't think it has anything to do with the DMZ because one host does fine and 
one DLE gets through from SSD. 

The lines in inetd.conf are identical. 

One thing common to the two of them is that monit says their CPU wait goes high 
from time to time (I understand that means some interrupts are slow, it I don't 
know why or the consequences of it) and only one of the SATA drives shows up in 
webmin's SMART monitoring, and that one says monitoring isn't enabled (it is in 
/etc/default). It's fine on NSD.

Diff claims the amandad binaries are identical on a working host and on SSD.

scp -r ssd:/var/www . copies the files with no sluggishness or latency.

Another possibly significant symptom is that it's the first DLE (the 150MB 
/boot partition) that seems to get an estimate from SSD. OTOH, the first DLE on 
NSD (/) does not.

This configuration (available on request) has been working for 2 or 3 years. 
I'm pretty certain I didn't change anything in there.

Any ideas on what may be wrong??

-- 
Glenn English
[email protected]



Reply via email to