Hello all. My backup setup has been working fine for the last months. Now I'm getting systematic FAILED dumps on the local filesystem. What changed? There is a user process that is running a scientific simulation program which uses about 500M of virtual memory footprint (the machine has 520M RAM and 1G swap). This may be causing timeouts, and I could increase the timeouts value, but from an inspection to the debug files, I'm not sure if it is just a timeout issue.
Therefore, I'm sending this mail to ask for help on figuring out what is wrong, from the log files. Ok, let's start with the amanda mail report: ---------------------------------------------------------------------- FAILURE AND STRANGE DUMP SUMMARY: omni //new/E$ lev 0 FAILED [Estimate timeout from omni] omni //new/C$ lev 0 FAILED [Estimate timeout from omni] omni /var/spool/imap/user/uz lev 0 FAILED [Estimate timeout from omni] omni /var/spool/imap/user/nt lev 0 FAILED [Estimate timeout from omni] [...] ---------------------------------------------------------------------- (we can forget for now the smb backups, and restrict our attention to the local filesystems) The following file is amandad.20050904195400000.debug, which I assume is amandad daemon running on localhost, attending requests to sendsize/backups: ---------------------------------------------------------------------- [...] amandad: time 0.000: got packet: -------- Amanda 2.4 REQ HANDLE 000-789C0608 SEQ 1125860041 SECURITY USER amanda SERVICE sendsize OPTIONS features=fffffeff9ffe0f;maxdumps=1;hostname=omni; GNUTAR //new/E$ 0 1970:1:1:0:0:0 -1 OPTIONS |;bsd-auth;compress-fast;index; GNUTAR //new/E$ 1 2005:8:30:3:5:35 -1 OPTIONS |;bsd-auth;compress-fast;index; GNUTAR //new/E$ 2 2005:8:31:3:50:7 -1 OPTIONS |;bsd-auth;compress-fast;index; GNUTAR //new/C$ 0 1970:1:1:0:0:0 -1 OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys; GNUTAR //new/C$ 1 2005:8:28:3:18:35 -1 OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys; GNUTAR //new/C$ 2 2005:8:31:3:53:48 -1 OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./pagefile.sys; GNUTAR /var/spool/imap/user/uz /var/spool/imap/user 0 1970:1:1:0:0:0 -1 OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./[a-t]*; GNUTAR /var/spool/imap/user/uz /var/spool/imap/user 1 2005:8:29:3:20:17 -1 OPTIONS |;bsd-auth;compress-fast;index;exclude-file=./[a-t]*; [...] -------- amandad: time 0.000: sending ack: ---- Amanda 2.4 ACK HANDLE 000-789C0608 SEQ 1125860041 ---- amandad: time 0.001: bsd security: remote host omni.isr.ist.utl.pt user amanda local user amanda amandad: time 0.001: amandahosts security check passed amandad: time 0.001: running service "/usr/local/amanda/libexec/sendsize" amandad: time 12571.560: sending REP packet: ---- Amanda 2.4 REP HANDLE 000-789C0608 SEQ 1125860041 OPTIONS features=fffffeff9ffe0f; / 0 SIZE 4974600 / 1 SIZE 678600 /boot 0 SIZE 6750 /boot 1 SIZE 10 /usr 0 SIZE 3312330 /usr 1 SIZE 174220 [...] ---- amandad: time 12581.571: dgram_recv: timeout after 10 seconds amandad: time 12581.571: waiting for ack: timeout, retrying amandad: time 12591.572: dgram_recv: timeout after 10 seconds amandad: time 12591.572: waiting for ack: timeout, retrying amandad: time 12601.572: dgram_recv: timeout after 10 seconds amandad: time 12601.572: waiting for ack: timeout, retrying amandad: time 12611.572: dgram_recv: timeout after 10 seconds amandad: time 12611.572: waiting for ack: timeout, retrying amandad: time 12621.572: dgram_recv: timeout after 10 seconds amandad: time 12621.572: waiting for ack: timeout, giving up! amandad: time 12621.602: pid 2657 finish time Sun Sep 4 23:24:21 2005 ---------------------------------------------------------------------- Now, let's check sendsize.20050904195400.debug: ---------------------------------------------------------------------- sendsize: debug 1 pid 2659 ruid 92 euid 92: start at Sun Sep 4 19:54:00 2005 sendsize: version 2.4.4p4 sendsize[2659]: time 0.232: waiting for any estimate child: 1 running sendsize[2664]: time 0.232: calculating for amname '/', dirname '/', spindle -1 sendsize[2664]: time 0.232: getting size via gnutar for / level 0 sendsize[2664]: time 0.513: spawning /usr/local/amanda/libexec/runtar in pipeline sendsize[2664]: argument list: /bin/tar --create --file /dev/null --directory / --one-file-system --listed-incremental /usr/local/amanda/var/amanda/gnutar-lists/omni__0.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._.20050904195400.exclude . sendsize[2664]: time 91.573: /bin/tar: ./var/amavis/tmp/spamassassin.149.wxdvgp.tmp: Warning: Cannot stat: No such file or directory sendsize[2664]: time 108.878: /bin/tar: ./var/spool/mqueue-mta/local/tfj84IdMD1018683: Warning: Cannot stat: No such file or directory [...] sendsize[1498]: estimate time for //new/E$ level 2: 0.444 sendsize[1498]: estimate size for //new/E$ level 2: 872525 KB sendsize[1498]: time 12571.433: waiting for /usr/bin/smbclient "//new/E$" child sendsize[1498]: time 12571.433: after /usr/bin/smbclient "//new/E$" wait sendsize[1498]: time 12571.433: done with amname '//new/E$', dirname '//new/E$', spindle -1 sendsize[2659]: time 12571.433: child 1498 terminated normally sendsize: time 12571.438: pid 2659 finish time Sun Sep 4 23:23:31 2005 ---------------------------------------------------------------------- Conclusions: sendsize finishes after 12571.438 secs, while amandad starts complaining about timeouts at time 12581.571. The timeouts could be caused by the abnormal CPU/swap load, but these timeouts are not visible in the logs. What do you suggest me to do next to trackdown the problem? Cheers, Rodrigo Ventura -- *** Rodrigo Martins de Matos Ventura <[EMAIL PROTECTED]> *** Web page: http://www.isr.ist.utl.pt/~yoda *** Teaching Assistant and PhD Student at ISR: *** Instituto de Sistemas e Robotica, Polo de Lisboa *** Instituto Superior Tecnico, Lisboa, PORTUGAL *** PGP fingerprint = 0119 AD13 9EEE 264A 3F10 31D3 89B3 C6C4 60C6 4585
