On Thu, 2009-11-12 at 12:32 -0500, Erik Garrison wrote:
> I didn't catch any errors in syslog.
> 
> I'm really should have gotten that strace.  I was sick when I
> finalized the recovery so perhaps this is why it slipped my mind.
> 
> Is there any way that the system could both copy the bad (corrupting)
> data and still raise the error?  If the ESTALE error isn't handled
> properly up the stack then perhaps the corrupted inodes could be
> copied.

It's possible that a bad error path in jfs could corrupt an in-memory
data structure that could lead to errors on the target side.  Since the
error paths aren't exercised very often, there could easily be some
latent errors.  (I haven't been diligent following up on "fuzzer" bugs
that do exercise the error paths.)

Copying bad file data wouldn't affect the file system's integrity, and
there is no way of copying bad metadata since the copied files are
created by normal system calls, and the metadata is re-created from
scratch.

> This seems to have been a problem for ext4:
> e6f009b0b45220c004672d41a58865e94946104d
>     ext4: return -EIO not -ESTALE on directory traversal through deleted inode
> 
> For convenience here's the full commit:
> 
> commit e6f009b0b45220c004672d41a58865e94946104d
> Author: Bryan Donlan <[email protected]>
> Date:   Sun Feb 22 21:20:25 2009 -0500
> 
>     ext4: return -EIO not -ESTALE on directory traversal through deleted inode
> 
>     ext4_iget() returns -ESTALE if invoked on a deleted inode, in order to
>     report errors to NFS properly.  However, in ext4_lookup(), this
>     -ESTALE can be propagated to userspace if the filesystem is corrupted
>     such that a directory entry references a deleted inode.  This leads to
>     a misleading error message - "Stale NFS file handle" - and confusion
>     on the part of the admin.
> 
>     The bug can be easily reproduced by creating a new filesystem, making
>     a link to an unused inode using debugfs, then mounting and attempting
>     to ls -l said link.
> 
>     This patch thus changes ext4_lookup to return -EIO if it receives
>     -ESTALE from ext4_iget(), as ext4 does for other filesystem metadata
>     corruption; and also invokes the appropriate ext*_error functions when
>     this case is detected.
> 
> 
> I have adapted this patch to the jfs case
> 
> diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
> index c79a427..0bbd489 100644
> --- a/fs/jfs/namei.c
> +++ b/fs/jfs/namei.c
> @@ -1471,9 +1471,15 @@ static struct dentry *jfs_lookup(struct inode
> *dip, struct dentry *dentry,
>         }
> 
>         ip = jfs_iget(dip->i_sb, inum);
> -       if (IS_ERR(ip)) {
> -               jfs_err("jfs_lookup: iget failed on inum %d", (uint) inum);
> -               return ERR_CAST(ip);
> +    if (unlikely(IS_ERR(ip))) {
> +        if (PTR_ERR(ip) == -ESTALE) {
> +                       jfs_err("deleted inode referenced: %u",
> +                                       inum);
> +                       return ERR_PTR(-EIO);
> +               } else {
> +            jfs_err("jfs_lookup: iget failed on inum %d", (uint) inum);
> +                       return ERR_CAST(ip);
> +               }
>         }
> 
>         dentry = d_splice_alias(ip, dentry);
> 
> 
> Testing will require:   "creating a new filesystem, making a link to
> an unused inode using debugfs, then mounting and attempting to ls -l
> said link."
> 
> Where should I submit the patch after I've tested it?

This is the right place for that.  I appreciate the patch and your
testing.

Thanks,
Shaggy

> Erik
> 
> On Wed, Nov 11, 2009 at 8:13 PM, Dave Kleikamp
> <[email protected]> wrote:
> > On Wed, 2009-11-11 at 22:11 +0100, Andi Kleen wrote:
> >> Erik Garrison <[email protected]> writes:
> >> >
> >> > I removed the bad ram and began efforts to recover the system.  I then
> >> > booted the system using an Ubuntu Karmic live CD and tried to back up
> >> > the data via a simple cp -a <src> <dest>.  This failed upon reaching
> >> > one of the corrupted files, and additionally left the target (also
> >> > JFS) filesystem damaged.  I had to reformat the target filesystem and
> >> > try again.
> >>
> >> Leaving the target damaged too when another file system threw an error
> >> sounds like a serious bug. Are you sure the new hardware was good?
> >
> > I had skimmed over this too quick and missed that.  Yeah.  That
> > shouldn't happen.  Were there any I/O errors in the syslog?
> >
> > Thanks,
> > Shaggy
> > --
> > David Kleikamp
> > IBM Linux Technology Center
> >
> >
-- 
David Kleikamp
IBM Linux Technology Center


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Reply via email to