Hello Daniel,
Normally, I would suspect disk failures.
Do you monitor for disk errors?
We use the following script "disk_check" via a regular crontab
to check the AIX error log for disk problems and report via
email to system admins.
An AIX migration install is likely to fill /usr (with PTFs etc).
Do you have adequate disk freespace in /usr/afs/?
I avoid migration installs. It seems to me that it is better
to start again with a fresh install for a new release of AIX.
It defragments the disks and you have less wasted local filesystem
space (all the acummulated junk disappears!).
A method I have used to upgrade fileservers is to configure
a "spare" machine with new levels of AIX/AFS using complete
fresh install. Then "vos move" all the AFS volumes from an
existing fileserver onto the newly installed box. When complete,
the "now empty" fileserver can be overwrite-installed to bring
it up to latest levels. This process is repeated for all fileservers
until they are all up-to-date.
This works best if your AFS fileservers are _pure_ fileservers
(eg provide no other site service).
Interestingly, this "vos-move-server-cycling" also reveals any
problematic volumes when you "empty" an old fileserver.
It is also useful to keep this "spare" machine available as a
standby to help with recovery etc if you have problems with
a "live" fileserver.
IMHO, the ability to move volumes without impacting users is one of
the features of AFS that is really brilliant.
I hope this helps.
--
cheers
paul http://acm.org/~mpb
--script "disk_check" follows--
#!/bin/ksh -
#
# @(#)disk_check 1.9 (hursley) 10/13/98
# /afs/hursley.ibm.com/common/src/afs/@cell/common/scripts/disk_check/SCCS/s.disk_check
#
# NAME disk_check
# AUTHOR Paul Blackburn http://acm.org/~mpb
# WRITTEN July 1997
# PURPOSE Check for new hard disk errors in AIX errlog
# USAGE Normally run frequently from a crontab
#
# 5,15,25,35,45,55 * * * * /afs/@cell/common/scripts/disk_check root
# 0,10,20,30,40,50 * * * * /afs/@cell/common/scripts/disk_check root
#
# Optionally, specify an alternate email notification address
# disk_check [EMAIL PROTECTED]
# HISTORY
# 1998/Oct/13 mpb added errpt check for "pdisk" errors (SSA)
# constants ---------------------------------------------------------------
# let's get defensive...
PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/sbin
IFS="
"
unset ENV
# enough defensiveness!
cmd=$(basename ${0})
cmdline="${cmd} $*"
logdir=/var/log/
log=${logdir}/${cmd}
last_run=/tmp/${cmd}_last_run
this_run=/tmp/${cmd}_this_run
diffs=/tmp/${cmd}_diffs
t=/tmp/${cmd}_t_$$
# Where to send the notification (via email) if a disk problem found
domain=$(awk '$1 == "domain" { print $2 }' /etc/resolv.conf)
default_notify=postmaster@${domain}
# functions ---------------------------------------------------------------
usage() {
echo "Usage: ${cmd}"
}
fatal() {
echo "${cmd} fatal: ${1}" >&2
exit 1
}
error() {
echo "${cmd} error: ${1}" >&2
}
warning() {
echo "${cmd} warning: ${1}" >&2
}
tstamp() {
echo "$(date '+''%H''%M'':%S') ${cmd}: ${1}"
}
doit() {
tstamp "${1}"
eval ${1}
retcode=$?
if [[ ${retcode} != 0 ]]; then
error "\$?=${retcode}"
fi
}
# main --------------------------------------------------------------------
if [[ -z "${1}" ]]; then
notify=${default_notify}
else
notify=${1}
fi
HOST=$(localhost 2>/dev/null || (hostname | awk -F. '{print $1}'))
errpt -a -S disk 2>&1 > ${this_run} # for SCSI disks
errpt -a -S pdisk 2>&1 >> ${this_run} # for SSA disks
if [[ $? != 0 ]]; then
fatal "\"errpt -S disk\" failed"
fi
if [[ -s ${last_run} ]]; then # there was a previous run
previous=${last_run}
diff ${previous} ${this_run} > ${diffs}
rm ${last_run}
else # no previous run
previous=/dev/null
diff ${previous} ${this_run} > ${diffs}
fi
mv ${this_run} ${last_run}
if [[ -s ${diffs} ]]; then # there are new errors
disk=$(grep "Resource Name:" ${diffs} | awk '{print $4; exit}')
disk_list=$(grep "Resource Name:" ${diffs} | awk '{print $4}')
err=$(grep "ERROR LABEL:" ${diffs} | awk '{print $4; exit}')
sub="WARNING: ${HOST} ${disk} has ${err} in errorlog"
cat << eeooff > ${t}
A new hard disk error ${err} has been detected on ${disk}
in the error log on ${HOST}. The details are attached (see below).
List of disks affected:
${disk_list}
--
sincerely
${cmd} script
== error log entry follows ==
diff ${previous} ${this_run}
eeooff
cat ${t} ${diffs} | mail -s "${sub}" ${notify}
rm ${t}
# Also, syslog the error
logger -i -t "${cmd}" "${sub}"
fi
rm ${diffs}
--end of script "disk_check"--
Daniel Blakeley <[EMAIL PROTECTED]> wrote:
>Hi,
>
>We're having problems with our AIX 4.3.2 AFS 3.5 servers. After a few
>days the AFS partitions fsck dirty with errors like the following.
>
>** Phase 5 - Check Inode Map
>Bad Inode Map; SALVAGE? y
>** Phase 5b - Salvage Inode Map
>map agsize bad, vm1->agsize = 0 agrsize = 2048
>map agsize bad, vm1->agsize = 56 agrsize = 2048
>map agsize bad, vm1->agsize = 0 agrsize = 2048
>...
>
>** Phase 6 - Check Block Map
>Bad Block Map; SALVAGE? y
>** Phase 6b - Salvage Block Map
>map agsize bad, vm1->agsize = -1 agrsize = 2048
>map agsize bad, vm1->agsize = -1073741825 agrsize = 2048
>map agsize bad, vm1->agsize = 1610612735 agrsize = 2048
>map agsize bad, vm1->agsize = -536870913 agrsize = 2048
>...
>
>Some history: In December we decided upgrade the AFS fileservers from
>AIX 4.1.5 to AIX 4.3.2 (migration install) and from AFS 3.4 to AFS
>3.5. Soon after, one of the fileservers rebooted and stayed up a few
>days then rebooted again. We decided to fsck (we are using the
>Transarc fsck) the entire machine and found the AIX partitions ok but
>all the AFS partitions had errors like the ones above. Soon after the
>fileserver on another server crashed and we fscked all the AFS
>partitions on all the AFS servers and found more errors like the ones
>above on all but the least active partitions. It has been a few days
>now and the AFS partitions are beginning to show fsck errors again.
>
>Some more info: The AFS servers are all differant models of RS/6000s
>and were very solid before the upgrade, so we don't think it is a
>hardware problem. We are running AFS clients at the 3.5, 3.4, and
>3.3(Linux) levels. Backups are now taking more than twice as long for
>some reason.
>
>If you can think of any reason why this is happening, we would be most
>interested.
>
>Thanks,
>- Daniel
>
>--
>Daniel Blakeley (N2YEN) Cornell Center for Materials Research
>[EMAIL PROTECTED] E20 Clark Hall