On Monday, 02.06.2003 at 10:43 +0100, Martin Hepworth wrote: > >>>Any ideas of why a client would work for a while then randomly not > >>>be able to do a selfchecK? The other amanda client is still working > >>>great... > >> > >>I have a random problem like this as well running RH Linux. The > >>client occasionally fails amcheck in the afternoon. (Backups run at > >>nite.) When I look at portland, the client, I find the selfcheck > >>task "stuck" and I am unable to kill it, even with kill -9. See if > >>you have the same problem. On the client, try > >> > >>ps -ef | grep amand > >> > >>or grep with whatever your amanda user account is. > >> > >>If you see selfcheck running, you'll be unable to get amcheck on the > >>server to finish until it's gone. Just something to check. > > > > > >Interesting to see this problem reported - I've had this happen > >sporadically too. The 'host down' error relates to the localhost and > >it leaves 'selfcheck' and 'amandad' running in the background. The > >server is RH Linux 7.3, running AMANDA 2.4.2p2. > > > >However, killing those processes does not make everything better. > >The problem seems unrelated to the AMANDA configuration. The last > >time it happened here, we were fortunate enough to have a > >'maintenance window' and rebooted the server and after that amcheck > >ran without complaint. However, given that this is a production > >server, rebooting is not a good solution. > > do you use 'localhost' or 'hostname' in the disk list. > > I perfer to use 'hostname' for the reason that if you move the amanda > server to 'someotherhostname' all the tapes etc still reflect the > correct hosts!
We use 'localhost' ... :-) > what do the debug logs in /tmp/amanda say when this happens, also > anything else in /var/log/messages indication anything odd at this > time? Nothing obviously helpful - the only difference between a working and non-working copy of the selfcheck and amcheck debugs are the timestamps and process IDs. The amandad debugs show "amandad: dgram_recv: timeout after 10 seconds" a few times followed by a "amandad: waiting for ack: timeout, giving up!" message a little later. As I said this is a Problem That Goes Away By Itself. I was wondering if there was some sort of DNS thing going on, but I couldn't get anywhere with that ... Dave. -- Dave Ewart [EMAIL PROTECTED] Computing Manager, Epidemiology Unit, Oxford Cancer Research UK PGP: CC70 1883 BD92 E665 B840 118B 6E94 2CFD 694D E370
