One and a half week back we had a real big crash here because of power problems
Beside other problems it costs use 3 2GB-Partions of afs data
Because we run into some general problems I think that it may be interesting for all 
of you
The trouble report # is TR-14440 and with the help of Todd de Santis (thanks) we are 
back again.
We are running AFS 3.3 on several platforms
AIX for data base servers (3) and fileservers
HPUX as fileservers
SUNSOL as fileservers
OSF1 as cleints
SUNOS as clients

After the crash all the systems come up again in general. 3 big partions were 
offline. so we thought first the the problem was reduced to storing back save 
tapes. What we recognize next day was the the VL database servers lost their 
Quorum every 15 Minutes. A bos restarted gets them back. It is very inconvient 
to restart the servers all 25 minutes and if you miss to long OSF1 clients stop 
with a panic dump. First we tought abount network problems an disable one of the three 
servers. This prolongs uptime to 1 hour. 
Next was that Todd recognizes, that we have an AFS Version 3.2 VL database 
which also seems to be corrupted. (I feel sorry that we have no known tool to 
check the version and the database itself).
So we decide to delete all VL databases and rebuilt them.
After deletion our Quorum problems were solved.
Rebuilding the Datdabase with vos syncvldb again shows problems.
First syncvldb says in many cases done but vos examine shows that nothing has 
happened.
Second in many cases RO were found first and RW were not registered
In all this cases we have to remove the entries with delentry, syncvldb 
the partition with the RW volume first and then use addsite to reconnect the 
RO volume (it doesn't work if you use the syncvldb on the partion with the 
RO again).
Last thing we recognize is, that the vos examine gives confuse information 
about the vol-id of the RO copy
The number is correct, if you haven't done a release yet.
In all other cases within the first part of the vos examine the id of the 
local clone the system will use during the release is reported as the RO id.
If you restore the VL database with the vos syncvldb and have not all volumes entered 
in the database  this is really confusing.

Last thing we recognize is that one of our fileservers still runs with AFS 3.2 
Software and again there is no automated checking possible

I hope this report will help you, if you are confronted with a similar situation
and will shorten the time to solve the problems. It takes us 5 days to recover.

Best regards 
Manfred

-- 
     Manfred Chr. Lang TH-Darmstadt               Voice: 06151-165565
                       Hochschulrechenzentrum     Fax:   06151-163050
                       Petersenstrasse 30         EMail: 
                       D6100 Darmstadt (Germany)   [EMAIL PROTECTED]


Reply via email to