On Tue, 14 Jul 1998, Mathias Feiler wrote:

> On Fri, 10 Jul 1998, Nathan Rawling wrote:
> 
> > 
> ---- snip ---- 
> > In my experience, the best way is this:
> > 
> >     1) Bring up new servers
> >     2) Add new servers to clients' CellServDB
> >     3) Wait a few days (I have bad luck)
> >     4) Kill the old servers
> >     5) Update clients' CellServDB files 
> > 
> > This results in minimal cell outage. There is a small downtime
> ---- snap -----
> 
> Hi to all of You 
> 
> I'm no old afs-guru, I'm involved for a year now.
> 
> Are You sure you did not mixed up some things ? 
> I think of this :  
> 3 new and 3 old one ! 
> Then : 4) Kill the old servers 
> Resutls in Syncside is gone. 

Perhaps I'm missing something, since I don't see much major difference
except for how many servers you bring down at once.

I guess my terse list was unclear. When you kill the old servers, the
remaining servers will renegotiate the sync-site if they have a quorum
based on their /usr/afs/etc/CellServDB files. If you were to kill the
three servers without taking them out of the servers' CellServDBs you
would likely eliminate your quorum, and deadlock the election.

IMHO, killing the three servers at once is cleaner, and you are going to
have to sit through an election anyways.

There any many ways of doing this that should work, based on the inherent
resiliency of AFS. There is a Transarc recommended way, and there are
benefits to doing things there way. Over the past few years, I've learned
what corners I could shave off.

For a different angle, lets talk about the things you want to avoid.

1) Server with bad database copies, or no database copies elected as sync 
   site. I've heard unofficial remarks from Transarc that a DB server
   with no databases will not allow itself to become the sync site. With
   that said, I'm not sure I'd risk it. Before you touch anything, make
   sure the recovery state (from udebug) is 1f.

2) Deadlocked cell election. Make sure that at any given time, the servers
   you want running and participating in the cell are the only ones in the
   /usr/afs/etc/CellServDB file. Typos can really hurt you here by 
   prolonging your election indefinitely.

3) AFS clients don't have any valid DB servers. This happened to me,
   it wasn't pretty, don't let it happen to you. It doesn't really hurt
   the clients any to have CellServDB files included servers that aren't
   there.

4) AFS clients CellServDB files don't contain the sync-site. The exact
   results of this escape me at the moment, but they're not pleasant.

5) Brand-new hardware failure. We had a very hard time with an Ultra 200E
   which had a factory defect motherboard. We thought we had the problem
   fixed *twice* and nearly brought it into the cell before it flaked out 
   again. After nearly a year of swapping out replacement parts, we 
   replaced the motherboard a second time (it was one of the first parts
   replaced) and the problem magically went away.

As an aside, if anyone has a Sun Microsystems Ultra 1 200E that they're
having really wierd problems with, there is a part number on the
motherboard, and you might want to mention the part number while on the
phone with Sun Warranty support.

Nathan

Reply via email to