Re: [Q] Why salvaging server occurs frequently??

Kwon Oh-hoon Wed, 25 Mar 1998 13:52:00 +0100 (MET)
> 
> You wrote:
> > From: Kwon Oh-hoon <[EMAIL PROTECTED]>
> > Message-Id: <[EMAIL PROTECTED]>
> > Subject: [Q] Why salvaging server occurs frequently??
> > To: [EMAIL PROTECTED]
> > Date: Wed, 25 Mar 1998 13:58:15 +0000 (KST)
> > Content-Type: text/plain; charset=EUC-KR
> > Sender: [EMAIL PROTECTED]
> > 
> > 
> >     We have three database servers on alpha_osf32 plaforms.
> >     AFS Product version of these servers is afs3.4 5.38.
> >     Our three database servers are also file servers.
> >     Because of salvaging file server frequently in DB Server,
> >     all users in our cell must stop doing work on almost everyday.
> > 
> >     In FileLog.old file, I found an error message "file assertion failed".
> >     To solve this problem, we upgraded our database servers from afs3.4 4=
> > .35
> >     to afs3.4 5.38. But, this error occured again.=20
> > 
> >     After using backup command "vos backupsys" for daily backup=20
> >     of all volumes, I think this problem has occured.
> > 
> >     Log files are in ftp.transarc.com:/pub/afsps/ftp/pohang-univ :
> >     FileLog, SalvageLog, FileLog.old, SalvageLog.old, core.file.fs
> > 
> >     Qustion 1) Why salvaging server occurs frequently in this case?
> >         How can this error "file assertion failed" be solved?
> >     Qustion 2) /vicepx/V0xxxxxxx.vol file may be removed manually.
> >         The volume is not in VLDB and not removed by the command "vos zap=
> ".
> 
> I assume you mean the files under:
>       /afs/transarc.com/public/anon-ftp/pub/afsps/ftp/pohang-univ
> As you noted, the important message (why it failed) is:
>       Assertion failed! file afsfileprocs.c, line 6016.
> To really be sure what this means, it's necessary to contact your
> transarc customer support representative.  Assuming, however, that
> the build for "afs 3.4 5.38" contains this ident line in "fileserver":
>       $Header: 
>/afs/transarc.com/project/fs/dev/afs/3.4/.stage13/rcs/viced/RCS/afsfileprocs.c,v 
>2.453 1997/09/26 19:08:18 chengjie Exp $
> then the assertion on line 6016 happens in the routine CopyOnWrite upon
> any read error, or any write error but ENOSPC happens.  When this assertion
> happens, you should also have a core file for the "fileserver" process.
> The core dump will probably be named
>       /usr/afs/logs/core.file.fs - or some such.
> You should probably rename it to something else before studying it; otherwise,
> it could be overwritten by another core dump.  You can look at it with
> your favorite debugger (say, adb), with something like:
>       # adb /usr/afs/bin/fileserver /usr/afs/logs/core.file.fs
>       errno/D
>       $c
> If errno was set by the read or write, then it is likely to be useful
> in terms of telling what the problem is.  The $c will tell you where
> the assertion was that failed.  If you don't see CopyOnWrite, then
> that may mean that some other assertion failed, and you will need to
> transarc for more clues about what went wrong.  With some patience,
> it is also possible to determine what disk, and what volume were being
> updated, but you'll really want to have transarc do this for you.
> You can facilitate this by saving a copy of your core dump & the
> corresponding fileserver binary, somewhere where your transarc customer
> service representative can look at it.
> 
> A likely cause is a disk error.  In this case, you should find that errno
> is set to EIO.  This will not be the only clue that there are problems.
> You should also find that there are messages on the console about disk
> read and write errors, and these messages should also be recorded in some
> file on the system (often /var/adm/messages, but check to be sure.)
> These messages should include the name of the disk that was failing,
> and the block number.  If you do find these, it's well worth your while
> to fix this as soon as possible, before you lose much data and time.
> A simple way that will find many disk errors is to use "dd" from the raw
> or block device, to /dev/null.  Any errors before the end of the disk
> are cause for alarm (an error at the *end* of the disk is acceptable; some
> Unix disk drivers return an error instead of EOF when this condition is
> hit).  Sometimes, your system will also come with a disk diagnostic aid
> that can format the disk; fancier versions may contain additional tests
> such as a non-destructive sequential read, or a random seek read, or
> some sort of write/read surface certification routine.  It is not a bad
> idea to run a write/read surface certification routine for a day or so
> before putting a new disk into service.   Be careful -- some of those
> tests may erase data on the disk.
> 
>                               -Marcus Watts
>                               UM ITD PD&D Umich Systems Group
> 

I executed debugger as you said, but I could not get errno.
The result is as follows.
 
h2o:root 107 # gdb /usr/afs/bin/fileserver /usr/afs/logs.back.Mar.23.22/core.file.fs
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (alpha-dec-osf3.2), Copyright 1996 Free Software Foundation, Inc...
Core was generated by `fileserver'.
Program terminated with signal 6, IOT/Abort trap.
Reading symbols from /usr/shlib/libc.so...done.
#0  0x3ff801072d8 in __kill ()
    at ../../../../../src/usr/ccs/lib/libc/alpha/kill.s:41
../../../../../src/usr/ccs/lib/libc/alpha/kill.s:41: No such file or directory.
(gdb)

And what is the file assertion?

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Kwon O-Hoon (�� ����)   
POSTECH Computing Center Researcher 
Personal E-Mail : [EMAIL PROTECTED]
Official E-Mail : [EMAIL PROTECTED], [EMAIL PROTECTED]
Homepage : http://www.postech.ac.kr/~dolphin 
Telephone : +82-562-279-2540
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Re: [Q] Why salvaging server occurs frequently??

Reply via email to