Another item to watch out for when decommissioning an AFS fileserver
is to make sure any AFS Backup volumesets that list that server by
name are revised to remove that server.  This is also important when
servers are renamed.

------------------------------------------------------------------------
Bill Pitre      Battelle PNNL, Box 999  M/S K1-87, Richland WA 99352
[EMAIL PROTECTED]        Phone: 509-375-2091       FAX: 509-375-6631
------------------------------------------------------------------------
      "We all need to be salvaged every once in a while."
                   -Troy Thompson (Battelle PNNL)




On Mon, 29 Jul 1996 23:11:58 -0400 (EDT)  Jeffrey Hutzelman wrote:

--------

> Garrett D'Amore wrote:
> > If you are rebuilding the server, you should at least do a vos remsite for
> > each of the volumes on the server.  You could get this information from
> > vos listvldb.  (You might want to write a script to process its output.)
> > The vos remsite command doesn't actually remove the volumes from the
> > server, it just removes the server from the volume's vldb entry.
> > 
> > By doing this, you are preventing clients from attempting to bind to that
> > fileserver and avoiding failures during releases of those volumes.
> > 
> > Other than that (as long as it is not a database server), removing it
> > should be no big deal.  Given your list of errors, it seems a rebuild of
> > the server from scratch might be the best policy.
> 
> Hmm... As suggested above, you'll definitely want to remove the VLDB's
> idea of volumes that exist there.  For RO volumes, the appropriate action
> is to use "vos remsite" to remove just that replication site.  For RW
> volumes housed on that server, use "vos delentry" to delete the VLDB
> entry for the relevant volumes.
> 
> Other than that, we simply shut down the machine; there shouldn't be much
> else required.
> 
> Andrew Mickish wrote:
> > > What is the procedure for removing a file server from the cell?
> > > Can you turn it off, then later tell the cell to forget about it,
> > > or do you have to interact with the machine to get it out of the cell?
> > > 
> > > We have a Solaris 5.3 machine that was intended as a test platform when
> > > we first set up the cell, but it has developed problems (independent of
> > > AFS) that require rebuilding.  It has AFS volumes, but refuses to serve
> > > them. Before wiping its disk, I thought it would be nice to officially sa
y
> > > goodbye. 
> 
> > > Here is some of its nasty behavior (it is named pvtserver):
> > > 
> > >   # fs checkservers
> > >   These servers unavailable due to network or server problems: 
> > >   pvtserver. 
> > > 
> > >   # vos examine sun4m_53
> > >   Could not fetch the information about volume 536870957 from the server
> > >   Possible communication failure
> > >   Error in vos examine command. Possible communication failure
> > >   Dump only information from VLDB
> > >   sun4m_53
> > >       RWrite: 536870957
> > >       number of sites -> 1
> > >          server pvtserver partition /vicepa RW Site
> > > 
> > >   # vos listvol pvtserver
> > >   Could not fetch the list of partitions from the server
> > >   Possible communication failure
> > >   Error in vos listvol command. Possible communication failure
> > > 
> > >   # bos status pvtserver
> > >   Instance upserver, currently running normally.
> > >   Instance upclientetc, temporarily disabled, stopped for too many
> > >   errors,     currently shutdown.
> > >   Instance runntp, currently running normally.
> > >   Instance fs, has core file, currently running normally.
> > >       Auxiliary status is: salvaging file system.
> > > 
> > > 
> > > Any insights about what condition this machine appears to be in would be
> > > appreciated.
> 
> All of what you just quoted can be explained by the fact that the
> salvager is currently running, which means the fileserver and volserver
> are not.  Thus, no volumes are served, "vos" commands that need to talk
> to that server won't work, and so on.  However, since you implied
> that it's an ongoing problem, the next question to ask is why?  It
> might be interesting to see the output of "bos status pvtserver fs -long",
> which will show you which server process last exited, and why.  The
> date of that core file might also be interesting; it's almost certainly
> /usr/afs/logs/core.file.fs.
> 
> It would also be instructive to look at the FileLog and wherever (if
> anywhere) you log the fileserver's stdout/stderr (we redirect the
> output of bosserver on startup to /usr/afs/logs/MiscLog, so as to be
> sure to catch all of this stuff).  If some piece of metadata were
> corrupted, the Fileserver might fail some assertion, which would result
> in a message printed to stderr followed by an exit on some signal.
> If you had source to the version of the fileserver you were running, you
> could see what failed in such a case...
> 
> In any event, if the problem is data corruption, it might be reasonable
> to simply wipe your vice partitions and start over.  In that case, you
> could simply re-newfs those partitions, and make sure to update the VLDB
> on what you've done.
> 
> -- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
>    Systems Programmer, CMU SCS Research Facility
>    Please send requests and problem reports to [EMAIL PROTECTED]
> 

--------

Reply via email to