[EMAIL PROTECTED] writes:
> The second corrupted volume, cs.usr0.mbell, 536877058, exhibited the
> same behavior as all of the other CopyOnWrite failures.  A log of
> what I did this morning is in ftp://ftp.cs.pitt.edu/hoffman/openafs/mbell.log.
> 
> The affected fileserver is running RedHat 7.2, kernel 2.4.9-21 and
> OpenAFS 1.2.3, non-threaded fileserver with the ihandle.c patch.
> 
> What should I try next?
> 
>       ---Bob.

I'm assuming that you have source to all the bits, and are prepared
to tackle this as a software developer, and want only clues as to which
avenues to investigate first.  If this doesn't describe you, the
following may be of less value to you.

The fileserver process is most likely seeing "EIO" from the kernel.  If
it didn't come from the hardware, then it's got to be a software
thing.  Several approaches:

(1) grep for EIO in the kernel source, try to figure out
        why the fileserver might get this error (ie, dig
        through the kernel filesys layer, syscall interface,
        etc...)  For kernel filesys code, this error *should*
        only indicate a hardware failure, but the linux developers
        (or POSIX standards, or yada yada...) don't necessarily
        share that belief, and you may well find a code path
        where that error really means "invalid parameter" (which
        "should" be EINVAL), authorization failure (EPERM or EACCES)
        etc.
(2) do a "strace" on the fileserver process, poke at the
        volume, see if you can get get a record of
        the failing syscall & parameters.  Also try strace
        on anything else that touches the disk and exercises the bug.
(3) The AFS source at least used to come with standalone utilities
        that would poke at the filesystem directly, using
        the same interface the fileserver uses.  There might
        be a self-contained voldump utility, or read an arbitrary
        inode, or some such.  Perhaps running those will generate
        interesting clues, or better yet, offer a smaller
        self-contained way to exercise the bug.
(4) try doing a "vos dump" for the affected volume, see what you get,
        both in terms of errors from volserver, & in terms of
        what's actually in the dump.  This won't destroy any data,
        so is definitely a simple diagnostic.
(5) try running the salvager on the affected volume(s), see
        the salvager sees.  Be prepared to restore the volume
        from tape--on the other hand, if the salvager eats the
        volume, you were probably going to have to do that anyways.
        This will likely destroy data, so it may destroy evidence
        of the bug.  You'll want to collect read-only evidence
        first before trying this.
(6) If strace doesn't show EIO coming back from kernel-land,
        then another avenue to investigate is libc -- is
        there something in there that could be returning EIO
        to the fileserver (there shouldn't be, not for this,
        but you never know...)  Also check for anything in
        the AFS libraries proper that might just happen to
        return this.

                                -Marcus Watts
                                UM ITCS Umich Systems Group
_______________________________________________
OpenAFS-devel mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to