Hi there...

I need some help in solving a recent problem with our AFS backup system.

First, the system description...
The cell spans 4 physically distant sites, we have 5 AFS database servers
(3 on the main site, one on site 2, one on site 3 and none on site 4).
The buserver runs on all 5 of them, and sync-site is usually one of the
3 main DB servers, as they all have lower IPs; the butc process runs on
one of the main servers, wich has a TZ867 tape unit; the backup process
is usually run on the same server.



Now the problems. We've been doing backups of AFS since the beginning of
the year with no problem whatsoever, with just 1 DB server. We then
added the extra 4 sites, and the backup still worked OK.

Trouble began just last week, it appears that no changes were done to
any kind of configuration (but you can never be too sure). The butc command
apparently executes OK, but upon execution of any "backup dump" command
they both lock up just after notifying the number of volumes to dump
and before listing them. The sync-site's buserver then proceeds to hog
up all of the CPU on that machine, and it stays like that until manually
stopped.

At that point the only unusual symptom (apart from being locked and
hogged up) is the "udebug <server> 7021" output:

        ...
        I am currently managing write trans 831050094.4
        ...
        0 locked pages, 0 of them for write
        There are write locks held
        ...

I have absolutely no clue to what it means.

I've found that restarting the hogged buserver sometimes works: the
dump proceeds up to the point where the database has to be updated and
then it locks up again until the next restart. (By the way, I fear
the integrity of my backup DB may be affected by this, aaargh.)

I've also unsucsessfully tried to use a reduced set of 3 backup servers;
shutting down and restarting all of the main servers on labor day;
restarting all of the DB servers either manually or waiting for the auto
restart on sunday. The backup still locks up every single time it is
tried.

Apart from this last idea (restoring the DB from tape is an option we've
not tried...), I've run out of imagination. If you have any hint thay may
help me solve this problem, please drop me a note...

Cheers,

        -Felipe

Reply via email to