Re: AIX 4.3.2 and AFS 3.5

Paul Blackburn Tue, 11 Jan 2000 13:08:10 +0100 (MET)
Hello Daniel,

Normally, I would suspect disk failures.
Do you monitor for disk errors?

We use the following script "disk_check" via a regular crontab
to check the AIX error log for disk problems and report via
email to system admins.

An AIX migration install is likely to fill /usr (with PTFs etc).
Do you have adequate disk freespace in /usr/afs/?

I avoid migration installs. It seems to me that it is better
to start again with a fresh install for a new release of AIX.
It defragments the disks and you have less wasted local filesystem
space (all the acummulated junk disappears!).

A method I have used to upgrade fileservers is to configure
a "spare" machine with new levels of AIX/AFS using complete
fresh install. Then "vos move" all the AFS volumes from an
existing fileserver onto the newly installed box. When complete,
the "now empty" fileserver can be overwrite-installed to bring
it up to latest levels. This process is repeated for all fileservers
until they are all up-to-date.

This works best if your AFS fileservers are _pure_ fileservers
(eg provide no other site service).

Interestingly, this "vos-move-server-cycling" also reveals any
problematic volumes when you "empty" an old fileserver.

It is also useful to keep this "spare" machine available as a
standby to help with recovery etc if you have problems with
a "live" fileserver.

IMHO, the ability to move volumes without impacting users is one of
the features of AFS that is really brilliant.

I hope this helps.
--
cheers
paul                             http://acm.org/~mpb


--script "disk_check" follows--

#!/bin/ksh -
#
# @(#)disk_check        1.9 (hursley) 10/13/98
# /afs/hursley.ibm.com/common/src/afs/@cell/common/scripts/disk_check/SCCS/s.disk_check
#
# NAME          disk_check
# AUTHOR        Paul Blackburn    http://acm.org/~mpb
# WRITTEN       July 1997
# PURPOSE       Check for new hard disk errors in AIX errlog
# USAGE         Normally run frequently from a crontab
#
#    5,15,25,35,45,55 * * * * /afs/@cell/common/scripts/disk_check root
#    0,10,20,30,40,50 * * * * /afs/@cell/common/scripts/disk_check root
#
#               Optionally, specify an alternate email notification address
#                       disk_check [EMAIL PROTECTED]
# HISTORY
#       1998/Oct/13 mpb added errpt check for "pdisk" errors (SSA)

# constants ---------------------------------------------------------------

# let's get defensive...
PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/sbin
IFS="   
"
unset ENV
# enough defensiveness!

cmd=$(basename ${0})
cmdline="${cmd} $*"
logdir=/var/log/
log=${logdir}/${cmd}

last_run=/tmp/${cmd}_last_run
this_run=/tmp/${cmd}_this_run
diffs=/tmp/${cmd}_diffs
t=/tmp/${cmd}_t_$$

# Where to send the notification (via email) if a disk problem found
domain=$(awk '$1 == "domain" { print $2 }' /etc/resolv.conf)
default_notify=postmaster@${domain}

# functions ---------------------------------------------------------------

usage() {
        echo "Usage: ${cmd}"
}

fatal() {
        echo "${cmd} fatal: ${1}" >&2
        exit 1
}

error() {
        echo "${cmd} error: ${1}" >&2
}

warning() {
        echo "${cmd} warning: ${1}" >&2
}

tstamp() {
        echo "$(date '+''%H''%M'':%S') ${cmd}: ${1}"
}

doit() {
        tstamp "${1}"
        eval ${1}

        retcode=$?
        if [[ ${retcode} != 0 ]]; then
                error "\$?=${retcode}"
        fi
}

# main --------------------------------------------------------------------

if [[ -z "${1}" ]]; then
        notify=${default_notify}
else
        notify=${1}
fi

HOST=$(localhost 2>/dev/null || (hostname | awk -F. '{print $1}'))

errpt -a -S disk 2>&1 > ${this_run}             # for SCSI disks
errpt -a -S pdisk 2>&1 >> ${this_run}           # for SSA disks

if [[ $? != 0 ]]; then
        fatal "\"errpt -S disk\" failed"
fi

if [[ -s ${last_run} ]]; then                   # there was a previous run
        previous=${last_run}
        diff ${previous} ${this_run} > ${diffs}
        rm ${last_run}
else                                            # no previous run
        previous=/dev/null
        diff ${previous} ${this_run} > ${diffs}
fi

mv ${this_run} ${last_run}

if [[ -s ${diffs} ]]; then                      # there are new errors
        disk=$(grep "Resource Name:" ${diffs} | awk '{print $4; exit}')
        disk_list=$(grep "Resource Name:" ${diffs} | awk '{print $4}')
        err=$(grep "ERROR LABEL:" ${diffs} | awk '{print $4; exit}')
        sub="WARNING: ${HOST} ${disk} has ${err} in errorlog"
        cat << eeooff > ${t}
A new hard disk error ${err} has been detected on ${disk}
in the error log on ${HOST}. The details are attached (see below).

List of disks affected:
${disk_list}
--
sincerely
${cmd} script

== error log entry follows ==

diff ${previous} ${this_run}
eeooff

        cat ${t} ${diffs} | mail -s "${sub}" ${notify}
        rm ${t}

# Also, syslog the error
        logger -i -t "${cmd}" "${sub}"
fi

rm ${diffs}


--end of script "disk_check"--




Daniel Blakeley <[EMAIL PROTECTED]> wrote:

>Hi,
>
>We're having problems with our AIX 4.3.2 AFS 3.5 servers.  After a few
>days the AFS partitions fsck dirty with errors like the following.
>
>** Phase 5 - Check Inode Map
>Bad Inode Map; SALVAGE? y
>** Phase 5b - Salvage Inode Map
>map agsize bad, vm1->agsize = 0 agrsize = 2048
>map agsize bad, vm1->agsize = 56 agrsize = 2048
>map agsize bad, vm1->agsize = 0 agrsize = 2048
>...
>
>** Phase 6 - Check Block Map
>Bad Block Map; SALVAGE? y
>** Phase 6b - Salvage Block Map
>map agsize bad, vm1->agsize = -1 agrsize = 2048
>map agsize bad, vm1->agsize = -1073741825 agrsize = 2048
>map agsize bad, vm1->agsize = 1610612735 agrsize = 2048
>map agsize bad, vm1->agsize = -536870913 agrsize = 2048
>...
>
>Some history: In December we decided upgrade the AFS fileservers from
>AIX 4.1.5 to AIX 4.3.2 (migration install) and from AFS 3.4 to AFS
>3.5.  Soon after, one of the fileservers rebooted and stayed up a few
>days then rebooted again.  We decided to fsck (we are using the
>Transarc fsck) the entire machine and found the AIX partitions ok but
>all the AFS partitions had errors like the ones above.  Soon after the
>fileserver on another server crashed and we fscked all the AFS
>partitions on all the AFS servers and found more errors like the ones
>above on all but the least active partitions.  It has been a few days
>now and the AFS partitions are beginning to show fsck errors again.
>
>Some more info: The AFS servers are all differant models of RS/6000s
>and were very solid before the upgrade, so we don't think it is a
>hardware problem.  We are running AFS clients at the 3.5, 3.4, and
>3.3(Linux) levels.  Backups are now taking more than twice as long for
>some reason.
>
>If you can think of any reason why this is happening, we would be most
>interested.
>
>Thanks,
>- Daniel
>
>--
>Daniel Blakeley (N2YEN)     Cornell Center for Materials Research
>[EMAIL PROTECTED]     E20 Clark Hall
Re: AIX 4.3.2 and AFS 3.5

Reply via email to