Solaris 7 (SPARC)
Amanda 2.4.2
Using ufsdump and gzip
(amadmin version output at end)
I've had two instances this year (the last one just this week) on one
of my backup clients of a file system on that client becoming locked
seemingly due to Amanda's estimate run.
I've only seen this on one of the backup clients (I'm backing up > 30
systems from this Amanda server) but there's nothing really special
about it. It's one of our mail relay machines - Sun Ultra5, Solaris 7,
disks mirrored using DiskSuite, sendmail.
The symptoms I'm seeing are that soon after the start of the nightly
Amanda run processes trying to access (or possibly write to) a
file system (it was /var the first time back in May, I forgot to check
which one it was today) just hang and this of course leads to a log jam.
Mail doesn't go through and processes get stuck and virtual memory fills up.
When I manage to get into the machine I've found both times that
there are a number of amanda processes running - sorry I forgot to grab
the exact ps output but I'm pretty sure sendsize and killpgrp were
there. When I kill these off the system gets _real_ busy for
a while as it catches up with things but then settles down. Obviously
Amanda doesn't manage to get a backup of this client during that run - it
sees it as a timeout.
I don't see anything obviously bad in the amanda log files on the client
until (in amandad.debug) after what looks like the end of the sendsize
report it says...
more sendsize stuff...
...
/var/spool/mqueue 0 SIZE 70850
/local 1 SIZE 96540
/local 2 SIZE 96434
/var/spool/sendmail-logs 0 SIZE 388076
/var/spool/sendmail-logs 1 SIZE 245347
/var/spool/sendmail-logs 2 SIZE 230612
----
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, giving up!
amandad: pid 7949 finish time Sat Jul 14 11:19:13 2001
[11:19 is when I killed off the processes on the client]
Does anyone recognise these symptoms? Any ideas on whether it's an Amanda
problem (which might go away if I update my installation to 2.4.2p2 which
I should probably do anyway) or something to do with ufsdump?
I've been running Amanda backups on this client for almost four months
now and this has only happened twice but it's a pain when it does
happen. It's making we very wary of adding our other two mail relays
to the Amanda schedule - I don't want to take all three out in one go :->.
Paul
--
Paul Haldane
Computing Service
University of Newcastle
build: VERSION="Amanda-2.4.2"
BUILT_DATE="Tuesday August 29 12:24:57 BST 2000"
BUILT_MACH="SunOS carr6 5.7 Generic_106541-07 sun4u sparc SUNW,Ultra-5_10"
CC="/usr/local/gnu/bin/gcc"
paths: bindir="/usr/local/amanda/bin"
sbindir="/usr/local/amanda/sbin"
libexecdir="/usr/local/amanda/libexec"
mandir="/usr/local/amanda/man" AMANDA_TMPDIR="/tmp/amanda"
AMANDA_DBGDIR="/tmp/amanda"
CONFIG_DIR="/usr/local/amanda/etc/amanda"
DEV_PREFIX="/dev/dsk/" RDEV_PREFIX="/dev/rdsk/"
DUMP="/usr/local/amanda/libexec/ufsdump"
[that's a standard Solaris ufsdump binary binary edited
using emacs to use an alternate dumpdates file so that
it doesn't conflict with another backup scheme we run in
parallel]
RESTORE="/usr/sbin/ufsrestore"
GNUTAR="/var/local/etc/gtar"
COMPRESS_PATH="/usr/local/gnu/bin/gzip"
UNCOMPRESS_PATH="/usr/local/gnu/bin/gzip"
MAILER="/usr/bin/mailx"
listed_incr_dir="/usr/local/amanda/var/amanda/gnutar-lists"
defs: DEFAULT_SERVER="ucsbs1.ncl.ac.uk"
DEFAULT_CONFIG="DailySet1"
DEFAULT_TAPE_SERVER="ucsbs1.ncl.ac.uk"
DEFAULT_TAPE_DEVICE="/dev/null" HAVE_MMAP HAVE_SYSVSHM
LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE BSD_SECURITY
USE_AMANDAHOSTS CLIENT_LOGIN="root" FORCE_USERID HAVE_GZIP
COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
And all the disks are using the following dumptype
define dumptype comp-root-tar {
root-tar
comment "Root partitions with compression"
compress client fast
}