One and a half week back we had a real big crash here because of power problems
Beside other problems it costs use 3 2GB-Partions of afs data
Because we run into some general problems I think that it may be interesting for all
of you
The trouble report # is TR-14440 and with the help of Todd de Santis (thanks) we are
back again.
We are running AFS 3.3 on several platforms
AIX for data base servers (3) and fileservers
HPUX as fileservers
SUNSOL as fileservers
OSF1 as cleints
SUNOS as clients
After the crash all the systems come up again in general. 3 big partions were
offline. so we thought first the the problem was reduced to storing back save
tapes. What we recognize next day was the the VL database servers lost their
Quorum every 15 Minutes. A bos restarted gets them back. It is very inconvient
to restart the servers all 25 minutes and if you miss to long OSF1 clients stop
with a panic dump. First we tought abount network problems an disable one of the three
servers. This prolongs uptime to 1 hour.
Next was that Todd recognizes, that we have an AFS Version 3.2 VL database
which also seems to be corrupted. (I feel sorry that we have no known tool to
check the version and the database itself).
So we decide to delete all VL databases and rebuilt them.
After deletion our Quorum problems were solved.
Rebuilding the Datdabase with vos syncvldb again shows problems.
First syncvldb says in many cases done but vos examine shows that nothing has
happened.
Second in many cases RO were found first and RW were not registered
In all this cases we have to remove the entries with delentry, syncvldb
the partition with the RW volume first and then use addsite to reconnect the
RO volume (it doesn't work if you use the syncvldb on the partion with the
RO again).
Last thing we recognize is, that the vos examine gives confuse information
about the vol-id of the RO copy
The number is correct, if you haven't done a release yet.
In all other cases within the first part of the vos examine the id of the
local clone the system will use during the release is reported as the RO id.
If you restore the VL database with the vos syncvldb and have not all volumes entered
in the database this is really confusing.
Last thing we recognize is that one of our fileservers still runs with AFS 3.2
Software and again there is no automated checking possible
I hope this report will help you, if you are confronted with a similar situation
and will shorten the time to solve the problems. It takes us 5 days to recover.
Best regards
Manfred
--
Manfred Chr. Lang TH-Darmstadt Voice: 06151-165565
Hochschulrechenzentrum Fax: 06151-163050
Petersenstrasse 30 EMail:
D6100 Darmstadt (Germany) [EMAIL PROTECTED]