On Thu, 2007-02-08 at 01:09 -0500, [EMAIL PROTECTED] wrote:
> I just noticed a peculiar problem, which has happened at least twice
> in the last 10 weeks---a file that -was- in the filesystem apparently
> vanished, then reappeared.  I'm looking for advice on how to debug
> this or research it.  [Also, see PS for a slightly different problem
> that might or might not be related.]
> 
> I'm using the JFS that comes out of the box with Ubuntu Breezy, which
> is kernel 2.6.12-10-386.  The filesystem is in a normal partition, NOT
> mounted over LVM, and it's not the root filesystem.  It's exported rw
> via NFS, but both times a file vanished & reappeared, nothing but the
> local host was accessing that filesystem (and that had been true for
> many minutes or hours; only one other host can touch that filesystem
> anyway).
> 
> I noticed the problem because I have some automation that's actually a
> large tcsh script.  It always runs with echo & verbose, and logs all
> its output to a file.  It tried to remove a file (call it FOO) simply
> via
>   "rm /blah/biff/FOO"
> and got
>   rm: cannot remove `/blah/biff/FOO': No such file or directory
> 
> I noticed the problem several days later, but FOO is actually there.
> To ensure I wasn't misreading something, I marked the entire pathname
> (the actual rm used a rooted pathname, e.g., "rm /blah/biff/FOO") in
> Emacs and, in a shell buffer, did "ls -alF /blah/biff/FOO" by yanking
> the marked pathname.  It shows the file.  "head -c 5 /blah/biff/FOO"
> gives me the first five bytes.  Thus, I'm 100% sure that the pathname
> the script tried to use and failed with, and the pathname I'm now
> using, are the same pathname.

I guess you know the situation well enough to rule out things like a
process recreating a same-named file at some later point, or some
component of the path being renamed temporarily and then renamed back.

> So I started reviewing my logs of that script and found another
> instance of the same problem from late November, on a different file.
> That file has since been deleted in some other way (I probably noticed
> it hanging around and nuked it without realizing why it was still
> there).  The most recent file, OTOH, is still there.  (It's large and
> I need the space, so I may delete it soon anyway, in the hopes that
> I'm not about to destroy debugging data.)

true > /blah/biff/FOO should give you the space back without wiping out
the file itself.

> These files get created by a script running on the machine with the
> JFS filesystem that pulls them via NFS from another machine (also
> running JFS) via cp.  Since it's a pull, I'm reasonably sure that NFS
> can't be responsible, and since it's cp, you'd think there'd be
> nothing unusual going on, and certainly the files don't seem to have
> any problems getting -used-: it's just that twice in the last 1000
> files or so, JFS has claimed a file wasn't there at the instant that
> rm tried to delete it.

I would agree that nfs probably isn't a factor.   Again, it's not
possible that the script at some later point replaced the missing file,
is it?

> I have -not- tried running "fsck.jfs -n -v" yet, nor am I sure that's
> even the thing to try.  I may try that in a few hours, when I have a
> window during which I can unmount it for a few minutes.

remounting it read-only may be a more palatable option.

> Given that I
> haven't noticed any -other- problems with this FS, I'm loathe to try a
> non -n version of fsck.

I don't blame you, unless running with -n shows a need to repair
anything.  If it does, I'd save a copy of any files or directories it
complains about before running without -n.

> (Although note that this machine gets booted
> every few days on average anyway, so the normal fsck on boot has been
> run fairly frequently.)

Normally, this would just replay the journal.  It doesn't do a full
check unless the file system is marked dirty or replaying the journal
fails.

>   I can't swear that a reboot didn't happen
> between the file's creation, use, and subsequent deletion attempt,
> but I could research that if it's important.  The machine has
> certainly been rebooted since the failed deletion (about a week
> ago) and today, when I noticed that the file was there when it
> shouldn't have been, checked my logs, and found the rm failure.
> 
> Grepping /var/log/* for "-i jfs" yields pretty much only
>   JFS: nTxBlock = 8098, nTxLock = 64790
> in kern.log from when the machine comes up.

Not every error has "jfs" in it.  Some errors contain the block device
name, such as sdaN or hdaN, so you may want to grep for those too.

> My /etc/fstab
> does have a nonzero "pass" for this filesystem, so I'm reasonably
> sure fsck is running.

I'm sure it is, or you wouldn't be able to remount it r/w after a
non-clean shutdown.

> Has anyone seen anything like this?

Not that I recall.

> P.S.  This machine occasionally has to have its reset button pushed
> due to issues with a PCI card that hangs.  (The machine itself very
> rarely hangs even when that happens, but sometimes an attempt to
> reboot it causes the shutdown to wedge partway through, and -that-
> requires hitting the reset button.  It runs headless, so I can't
> easily tell exactly where the shutdown hangs.) Some small fraction of
> the time that -that- happens, I notice that a large file in its JFS
> that -had- been around for many minutes (long enough to get written to
> a DVD -and- to have cmp compare the entire file to what got written to
> the DVD) has vanished when the machine came back up.  I find this
> quite surprising and worrisome---surely syncs are happening every few
> seconds, or at the very least every few minutes, and yet the file is
> simply -gone- after the reset.

By default pdflush pushes all the dirty data out every 30 seconds.  You
can check /proc/sys/vm/dirty_expire_centisecs to see if it's changed.

It's possible that the hang is preventing the file system from writing
to the disk.  The next time it happens, "echo t > /proc/sysrq-trigger"
will dump every process's stack trace to the syslog.  (Newer kernels
only show blocked threads with "echo w".)

> I've now taken to having my automation
> actually call sync when it's done, so a subsequent reset (which never
> happens in the middle of the script---I can and do always wait until
> the script finishes before either attempting to reboot or hitting the
> reset button) is guaranteed to happen after a sync.

I interpret this to mean it hasn't happened since you added the sync.
If my suspicion is correct that the file system is hung up, the sync
will never complete.

> I'm now keeping
> an eye on this to see if this ever recurs---but do I misunderstand how
> often JFS actually manages to commit its metadata?  [I can't swear
> that the vanishing file was always after a reset instead of a
> controlled shutdown/reboot, but I'm guessing that it was.]

It shouldn't take much longer than 30 seconds for anything to be
committed to the journal.  If the hang is not affecting the file system,
there may be a bug to track down.  I looked through the fixes that have
been applied to jfs since 2.6.12 and didn't see anything that I think
would have fixed it.  (Actually, if the hang is actually due to a bug in
jfs, there might be a fix available, but it sounds like you know the
source of the hang.)

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Jfs-discussion mailing list
Jfs-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Reply via email to