Re: Decommissioning a server

Jeffrey Hutzelman Tue, 30 Jul 1996 03:56:42 GMT
Garrett D'Amore wrote:
> If you are rebuilding the server, you should at least do a vos remsite for
> each of the volumes on the server.  You could get this information from
> vos listvldb.  (You might want to write a script to process its output.)
> The vos remsite command doesn't actually remove the volumes from the
> server, it just removes the server from the volume's vldb entry.
> 
> By doing this, you are preventing clients from attempting to bind to that
> fileserver and avoiding failures during releases of those volumes.
> 
> Other than that (as long as it is not a database server), removing it
> should be no big deal.  Given your list of errors, it seems a rebuild of
> the server from scratch might be the best policy.

Hmm... As suggested above, you'll definitely want to remove the VLDB's
idea of volumes that exist there.  For RO volumes, the appropriate action
is to use "vos remsite" to remove just that replication site.  For RW
volumes housed on that server, use "vos delentry" to delete the VLDB
entry for the relevant volumes.

Other than that, we simply shut down the machine; there shouldn't be much
else required.

Andrew Mickish wrote:
> > What is the procedure for removing a file server from the cell?
> > Can you turn it off, then later tell the cell to forget about it,
> > or do you have to interact with the machine to get it out of the cell?
> > 
> > We have a Solaris 5.3 machine that was intended as a test platform when
> > we first set up the cell, but it has developed problems (independent of
> > AFS) that require rebuilding.  It has AFS volumes, but refuses to serve
> > them. Before wiping its disk, I thought it would be nice to officially say
> > goodbye. 

> > Here is some of its nasty behavior (it is named pvtserver):
> > 
> >   # fs checkservers
> >   These servers unavailable due to network or server problems: 
> >   pvtserver. 
> > 
> >   # vos examine sun4m_53
> >   Could not fetch the information about volume 536870957 from the server
> >   Possible communication failure
> >   Error in vos examine command. Possible communication failure
> >   Dump only information from VLDB
> >   sun4m_53
> >       RWrite: 536870957
> >       number of sites -> 1
> >          server pvtserver partition /vicepa RW Site
> > 
> >   # vos listvol pvtserver
> >   Could not fetch the list of partitions from the server
> >   Possible communication failure
> >   Error in vos listvol command. Possible communication failure
> > 
> >   # bos status pvtserver
> >   Instance upserver, currently running normally.
> >   Instance upclientetc, temporarily disabled, stopped for too many
> >   errors,     currently shutdown.
> >   Instance runntp, currently running normally.
> >   Instance fs, has core file, currently running normally.
> >       Auxiliary status is: salvaging file system.
> > 
> > 
> > Any insights about what condition this machine appears to be in would be
> > appreciated.

All of what you just quoted can be explained by the fact that the
salvager is currently running, which means the fileserver and volserver
are not.  Thus, no volumes are served, "vos" commands that need to talk
to that server won't work, and so on.  However, since you implied
that it's an ongoing problem, the next question to ask is why?  It
might be interesting to see the output of "bos status pvtserver fs -long",
which will show you which server process last exited, and why.  The
date of that core file might also be interesting; it's almost certainly
/usr/afs/logs/core.file.fs.

It would also be instructive to look at the FileLog and wherever (if
anywhere) you log the fileserver's stdout/stderr (we redirect the
output of bosserver on startup to /usr/afs/logs/MiscLog, so as to be
sure to catch all of this stuff).  If some piece of metadata were
corrupted, the Fileserver might fail some assertion, which would result
in a message printed to stderr followed by an exit on some signal.
If you had source to the version of the fileserver you were running, you
could see what failed in such a case...

In any event, if the problem is data corruption, it might be reasonable
to simply wipe your vice partitions and start over.  In that case, you
could simply re-newfs those partitions, and make sure to update the VLDB
on what you've done.

-- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]>
   Systems Programmer, CMU SCS Research Facility
   Please send requests and problem reports to [EMAIL PROTECTED]
Re: Decommissioning a server

Reply via email to