Probably the best way to stress a network setup is to let the power company work their 
magic (ie power outages longer than the UPS's can hold).

Victim to one of these this week (and madder than a wet hornet), I've found out some 
things that may help others and make AFS little more fault tolerant under the extreme.

I'm running OpenAFS 1.03 and RH62 with a 2.2.18 kernel (all self compiled).  The two 
systems I will mention are still in lab and not yet in production.

#1) My AFS server is also set up as a client to itself.  The /cache partition is a 
dedicated ext2 partition at about 230megs.  While it didn't have any activity at 5am, 
apparently some of the cache files were open at time time of power failure.  When I 
came in the AFS start script was hung and would not restart no matter what (ie would 
panic the system).  I deleted everything out of /cache and let it recreate itself and 
all came back normally.

Paranoid Solution #1:  Add this line in the /etc/rc.d/init.d/afs start script just 
under the start) line: (watch out for line wrapping)

#######
find /usr/vice/cache/ -depth -print 2>/dev/null | grep -v "^/usr/vice/cache/$" | grep 
-v "^/usr/vice/cache$" | grep -v "^lost+found$" | xargs -l20 rm -rf 2>/dev/null
#######

On a side note, does anyone have an idea why corrupted cache files would not just be 
deleted???

#2) My faithful client box was a bit perplexed about not finding its server and had 
hung on startup.  My eventual goal for all this (as any admin) is to not have the box 
come fully up until it can adequately connect to the server (otherwise I have to go do 
it manually).  My solution is simple but still somewhat of a hack.

Paranoid Solution #2:  Do a test to see if the AFS server is running, if not, wait; if 
fail, exit and don't even try to start.  This script bit is *reasonably* server aware 
(trying to keep it generic so it is easier to distribute on my side).  Add this 
section in the /etc/rc.d/init.d/afs start script just under the start) line: (watch 
out for line wrapping)

#######

        # Do a primitive "up or wait test" first for a non-server client.
        # This section should only be executed on a client system.
        if ! test -e /usr/afs/bin/bosserver ; then
        # Be careful in positioning this AFS start script in relation to other start 
scripts.
        SECONDS=0
        # Find the program on that will return !0 if fails (udebug someday?).
        if test -e /usr/vice/bin/vos ; then
           TESTME="/usr/vice/bin/vos listvol"
        else
           TESTME="vos listvol"
           fi
        # This is a generic kludge and may not work for everyone.
        # It also assumes the primary AFS server is the lowest IP number.
        THISCELL=`cat /usr/vice/etc/ThisCell | tr -d "\r" | tr -d "\n"`
        until ${TESTME} `grep -i ${THISCELL} /usr/vice/etc/CellServDB|grep -v 
"^>"|grep -v "^#"|sort|head -1|cut -f1 -d \ |cut -f1` >/dev/null 2>/dev/null
        # If the generic kludge does not work, use this line and spell out the server.
        # until ${TESTME} afs01.k50.net >/dev/null 2>/dev/null
           do
           echo "Searching for AFS server failed, retrying..."
           if test $SECONDS -gt 3600 ; then
              echo "Unable to contact AFS server, exiting."
              exit 1
              fi
           sleep 10s
           done
        fi #end of big client if.

#######

Originally I was going to use the udebug program to do the ${TESTME} part, but udebug 
always seems to return 0.  When I pointed it at a dummy IP, it printed out a -1 error, 
but still returned 0.  Is that suppose to be???

Now, I guess the big question is can something like this be assimilated into the main 
distribution?  maybe as an option or something?  Not so much as deleting the cache on 
ever start, but waiting until the server is there before the client starts?  Typically 
when my AFS client does a "half start" it will not do a full start later without a 
reboot (this is why I prefer all or nothing starts).

As always, this code is provided as is without any warranty.  Just because it works on 
my system doesn' t mean it will work perfectly on yours.  If you don't know a lot 
about shell scripting, BACK UP and GET HELP.  This message will self destruct in 30 
seconds...

B++/K90, Inc.the client starts?  Typically when my AFS client does a "half start" it 
will not do a full start later without a reboot (this is why I prefer all or nothing 
starts).

As always, this code is provided as is without any warranty.  Just because it works on 
my system doesn' t mean it will work perfectly on yours.  If you don't know a lot 
about shell scripting, BACK UP and GET HELP.  This message will self destruct in 30 
seconds...

B++/K90, Inc.
_______________________________________________
OpenAFS-devel mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo.cgi/openafs-devel

Reply via email to