Solaris 7, ufsdump - (very occasional) system hang

Paul . Haldane Sat, 14 Jul 2001 17:57:29 -0700
Solaris 7 (SPARC)
Amanda 2.4.2
Using ufsdump and gzip
(amadmin version output at end)


I've had two instances this year (the last one just this week) on one
of my backup clients of a file system on that client becoming locked
seemingly due to Amanda's estimate run.

I've only seen this on one of the backup clients (I'm backing up > 30
systems from this Amanda server) but there's nothing really special
about it.  It's one of our mail relay machines - Sun Ultra5, Solaris 7,
disks mirrored using DiskSuite, sendmail.

The symptoms I'm seeing are that soon after the start of the nightly
Amanda run processes trying to access (or possibly write to) a
file system (it was /var the first time back in May, I forgot to check
which one it was today) just hang and this of course leads to a log jam.
Mail doesn't go through and processes get stuck and virtual memory fills up.

When I manage to get into the machine I've found both times that
there are a number of amanda processes running - sorry I forgot to grab
the exact ps output but I'm pretty sure sendsize and killpgrp were
there.  When I kill these off the system gets _real_ busy for
a while as it catches up with things but then settles down.  Obviously
Amanda doesn't manage to get a backup of this client during that run - it
sees it as a timeout.

I don't see anything obviously bad in the amanda log files on the client
until (in amandad.debug) after what looks like the end of the sendsize
report it says...

more sendsize stuff...
...
/var/spool/mqueue 0 SIZE 70850
/local 1 SIZE 96540
/local 2 SIZE 96434
/var/spool/sendmail-logs 0 SIZE 388076
/var/spool/sendmail-logs 1 SIZE 245347
/var/spool/sendmail-logs 2 SIZE 230612
----

amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, retrying
amandad: waiting for ack: timeout, giving up!
amandad: pid 7949 finish time Sat Jul 14 11:19:13 2001
        [11:19 is when I killed off the processes on the client]

Does anyone recognise these symptoms?  Any ideas on whether it's an Amanda
problem (which might go away if I update my installation to 2.4.2p2 which
I should probably do anyway) or something to do with ufsdump?
I've been running Amanda backups on this client for almost four months
now and this has only happened twice but it's a pain when it does
happen.  It's making we very wary of adding our other two mail relays
to the Amanda schedule - I don't want to take all three out in one go :->.

Paul
-- 
Paul Haldane
Computing Service
University of Newcastle


build: VERSION="Amanda-2.4.2"
       BUILT_DATE="Tuesday August 29 12:24:57 BST 2000"
       BUILT_MACH="SunOS carr6 5.7 Generic_106541-07 sun4u sparc SUNW,Ultra-5_10"
       CC="/usr/local/gnu/bin/gcc"
paths: bindir="/usr/local/amanda/bin"
       sbindir="/usr/local/amanda/sbin"
       libexecdir="/usr/local/amanda/libexec"
       mandir="/usr/local/amanda/man" AMANDA_TMPDIR="/tmp/amanda"
       AMANDA_DBGDIR="/tmp/amanda"
       CONFIG_DIR="/usr/local/amanda/etc/amanda"
       DEV_PREFIX="/dev/dsk/" RDEV_PREFIX="/dev/rdsk/"
       DUMP="/usr/local/amanda/libexec/ufsdump"
              [that's a standard Solaris ufsdump binary binary edited
               using emacs to use an alternate dumpdates file so that
               it doesn't conflict with another backup scheme we run in
               parallel]
       RESTORE="/usr/sbin/ufsrestore"
       GNUTAR="/var/local/etc/gtar"
       COMPRESS_PATH="/usr/local/gnu/bin/gzip"
       UNCOMPRESS_PATH="/usr/local/gnu/bin/gzip"
       MAILER="/usr/bin/mailx"
       listed_incr_dir="/usr/local/amanda/var/amanda/gnutar-lists"
defs:  DEFAULT_SERVER="ucsbs1.ncl.ac.uk"
       DEFAULT_CONFIG="DailySet1"
       DEFAULT_TAPE_SERVER="ucsbs1.ncl.ac.uk"
       DEFAULT_TAPE_DEVICE="/dev/null" HAVE_MMAP HAVE_SYSVSHM
       LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE BSD_SECURITY
       USE_AMANDAHOSTS CLIENT_LOGIN="root" FORCE_USERID HAVE_GZIP
       COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
       COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"

And all the disks are using the following dumptype

define dumptype comp-root-tar {
    root-tar
    comment "Root partitions with compression"
    compress client fast
}
Solaris 7, ufsdump - (very occasional) system hang

Reply via email to