I just noticed a peculiar problem, which has happened at least twice
in the last 10 weeks---a file that -was- in the filesystem apparently
vanished, then reappeared.  I'm looking for advice on how to debug
this or research it.  [Also, see PS for a slightly different problem
that might or might not be related.]

I'm using the JFS that comes out of the box with Ubuntu Breezy, which
is kernel 2.6.12-10-386.  The filesystem is in a normal partition, NOT
mounted over LVM, and it's not the root filesystem.  It's exported rw
via NFS, but both times a file vanished & reappeared, nothing but the
local host was accessing that filesystem (and that had been true for
many minutes or hours; only one other host can touch that filesystem
anyway).

I noticed the problem because I have some automation that's actually a
large tcsh script.  It always runs with echo & verbose, and logs all
its output to a file.  It tried to remove a file (call it FOO) simply
via
  "rm /blah/biff/FOO"
and got
  rm: cannot remove `/blah/biff/FOO': No such file or directory

I noticed the problem several days later, but FOO is actually there.
To ensure I wasn't misreading something, I marked the entire pathname
(the actual rm used a rooted pathname, e.g., "rm /blah/biff/FOO") in
Emacs and, in a shell buffer, did "ls -alF /blah/biff/FOO" by yanking
the marked pathname.  It shows the file.  "head -c 5 /blah/biff/FOO"
gives me the first five bytes.  Thus, I'm 100% sure that the pathname
the script tried to use and failed with, and the pathname I'm now
using, are the same pathname.

So I started reviewing my logs of that script and found another
instance of the same problem from late November, on a different file.
That file has since been deleted in some other way (I probably noticed
it hanging around and nuked it without realizing why it was still
there).  The most recent file, OTOH, is still there.  (It's large and
I need the space, so I may delete it soon anyway, in the hopes that
I'm not about to destroy debugging data.)

These files get created by a script running on the machine with the
JFS filesystem that pulls them via NFS from another machine (also
running JFS) via cp.  Since it's a pull, I'm reasonably sure that NFS
can't be responsible, and since it's cp, you'd think there'd be
nothing unusual going on, and certainly the files don't seem to have
any problems getting -used-: it's just that twice in the last 1000
files or so, JFS has claimed a file wasn't there at the instant that
rm tried to delete it.

I have -not- tried running "fsck.jfs -n -v" yet, nor am I sure that's
even the thing to try.  I may try that in a few hours, when I have a
window during which I can unmount it for a few minutes.  Given that I
haven't noticed any -other- problems with this FS, I'm loathe to try a
non -n version of fsck.  (Although note that this machine gets booted
every few days on average anyway, so the normal fsck on boot has been
run fairly frequently.)  I can't swear that a reboot didn't happen
between the file's creation, use, and subsequent deletion attempt,
but I could research that if it's important.  The machine has
certainly been rebooted since the failed deletion (about a week
ago) and today, when I noticed that the file was there when it
shouldn't have been, checked my logs, and found the rm failure.

Grepping /var/log/* for "-i jfs" yields pretty much only
  JFS: nTxBlock = 8098, nTxLock = 64790
in kern.log from when the machine comes up.  My /etc/fstab
does have a nonzero "pass" for this filesystem, so I'm reasonably
sure fsck is running.

Has anyone seen anything like this?

P.S.  This machine occasionally has to have its reset button pushed
due to issues with a PCI card that hangs.  (The machine itself very
rarely hangs even when that happens, but sometimes an attempt to
reboot it causes the shutdown to wedge partway through, and -that-
requires hitting the reset button.  It runs headless, so I can't
easily tell exactly where the shutdown hangs.) Some small fraction of
the time that -that- happens, I notice that a large file in its JFS
that -had- been around for many minutes (long enough to get written to
a DVD -and- to have cmp compare the entire file to what got written to
the DVD) has vanished when the machine came back up.  I find this
quite surprising and worrisome---surely syncs are happening every few
seconds, or at the very least every few minutes, and yet the file is
simply -gone- after the reset.  I've now taken to having my automation
actually call sync when it's done, so a subsequent reset (which never
happens in the middle of the script---I can and do always wait until
the script finishes before either attempting to reboot or hitting the
reset button) is guaranteed to happen after a sync.  I'm now keeping
an eye on this to see if this ever recurs---but do I misunderstand how
often JFS actually manages to commit its metadata?  [I can't swear
that the vanishing file was always after a reset instead of a
controlled shutdown/reboot, but I'm guessing that it was.]

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Jfs-discussion mailing list
Jfs-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Reply via email to