Probably the best way to stress a network setup is to let the power company work their
magic (ie power outages longer than the UPS's can hold).
Victim to one of these this week (and madder than a wet hornet), I've found out some
things that may help others and make AFS little more fault tolerant under the extreme.
I'm running OpenAFS 1.03 and RH62 with a 2.2.18 kernel (all self compiled). The two
systems I will mention are still in lab and not yet in production.
#1) My AFS server is also set up as a client to itself. The /cache partition is a
dedicated ext2 partition at about 230megs. While it didn't have any activity at 5am,
apparently some of the cache files were open at time time of power failure. When I
came in the AFS start script was hung and would not restart no matter what (ie would
panic the system). I deleted everything out of /cache and let it recreate itself and
all came back normally.
Paranoid Solution #1: Add this line in the /etc/rc.d/init.d/afs start script just
under the start) line: (watch out for line wrapping)
#######
find /usr/vice/cache/ -depth -print 2>/dev/null | grep -v "^/usr/vice/cache/$" | grep
-v "^/usr/vice/cache$" | grep -v "^lost+found$" | xargs -l20 rm -rf 2>/dev/null
#######
On a side note, does anyone have an idea why corrupted cache files would not just be
deleted???
#2) My faithful client box was a bit perplexed about not finding its server and had
hung on startup. My eventual goal for all this (as any admin) is to not have the box
come fully up until it can adequately connect to the server (otherwise I have to go do
it manually). My solution is simple but still somewhat of a hack.
Paranoid Solution #2: Do a test to see if the AFS server is running, if not, wait; if
fail, exit and don't even try to start. This script bit is *reasonably* server aware
(trying to keep it generic so it is easier to distribute on my side). Add this
section in the /etc/rc.d/init.d/afs start script just under the start) line: (watch
out for line wrapping)
#######
# Do a primitive "up or wait test" first for a non-server client.
# This section should only be executed on a client system.
if ! test -e /usr/afs/bin/bosserver ; then
# Be careful in positioning this AFS start script in relation to other start
scripts.
SECONDS=0
# Find the program on that will return !0 if fails (udebug someday?).
if test -e /usr/vice/bin/vos ; then
TESTME="/usr/vice/bin/vos listvol"
else
TESTME="vos listvol"
fi
# This is a generic kludge and may not work for everyone.
# It also assumes the primary AFS server is the lowest IP number.
THISCELL=`cat /usr/vice/etc/ThisCell | tr -d "\r" | tr -d "\n"`
until ${TESTME} `grep -i ${THISCELL} /usr/vice/etc/CellServDB|grep -v
"^>"|grep -v "^#"|sort|head -1|cut -f1 -d \ |cut -f1` >/dev/null 2>/dev/null
# If the generic kludge does not work, use this line and spell out the server.
# until ${TESTME} afs01.k50.net >/dev/null 2>/dev/null
do
echo "Searching for AFS server failed, retrying..."
if test $SECONDS -gt 3600 ; then
echo "Unable to contact AFS server, exiting."
exit 1
fi
sleep 10s
done
fi #end of big client if.
#######
Originally I was going to use the udebug program to do the ${TESTME} part, but udebug
always seems to return 0. When I pointed it at a dummy IP, it printed out a -1 error,
but still returned 0. Is that suppose to be???
Now, I guess the big question is can something like this be assimilated into the main
distribution? maybe as an option or something? Not so much as deleting the cache on
ever start, but waiting until the server is there before the client starts? Typically
when my AFS client does a "half start" it will not do a full start later without a
reboot (this is why I prefer all or nothing starts).
As always, this code is provided as is without any warranty. Just because it works on
my system doesn' t mean it will work perfectly on yours. If you don't know a lot
about shell scripting, BACK UP and GET HELP. This message will self destruct in 30
seconds...
B++/K90, Inc.the client starts? Typically when my AFS client does a "half start" it
will not do a full start later without a reboot (this is why I prefer all or nothing
starts).
As always, this code is provided as is without any warranty. Just because it works on
my system doesn' t mean it will work perfectly on yours. If you don't know a lot
about shell scripting, BACK UP and GET HELP. This message will self destruct in 30
seconds...
B++/K90, Inc.
_______________________________________________
OpenAFS-devel mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo.cgi/openafs-devel