I've had this problem for -years- now, and I'm totally fed up with it.

I've got a machine with an unstable piece of hardware in it.  Every so
often, it locks up, and I have to push the reset button.  With some
low (but far from zero) probability, this causes the last file written
to one of its JFS filesystems to VANISH when the machine comes back up.

Today, this happened with a file written FIVE HOURS before I had to
reset the machine, with NO activity on the JFS since then.

The machine runs headless and automounts these partitions, and
I strongly suspect that what's happening is that fsck runs on the
partition and decides that one file is too damaged to keep, and just
flushes it.  Unfortunately, I have zero logging information from the
fsck available.  (I did, however, note the same behavior once when the
machine refused to mount the JFS and I had to run fsck.jfs by hand, so
it's clear that this is occasionally happening.)

WHY, oh WHY, has JFS not committed all of its journal info HOURS
beforehand (like, within the standard 30 second pdflush interval)
such that the FS doesn't kill the most-recently-written file in this
case?  If ext3 wasn't so godawful slow at deleting large files, I'd
dump JFS in a heartbeat over this, since at least ext3 doesn't just
randomly misplace files after a crash.

Is there anything I -can- do to keep it from doing this?  Saying
"just unmount the FS before you push reset" is a nonstarter, since
the resets themselves are occasional and I never know when I'll need
to do so and by the time it's necessary the machine is completely
hung anyway, and the FS is generally in pretty-constant use and I'd
have to kill a lot of jobs just to unmount it (including, also, NFS,
but also the normal programs running on the machine, which tend to
read a few multi-GB files an hour and write a few of them an hour as
well).

Is there -anything- I can do to improve my odds of NOT having JFS
simply eat a file if I have to reset the machine?

(This is quite an old kernel by now---2.6.12 in Ubuntu---but I can't
upgrade the machine in any reasonable timeframe due to a raft of other
considerations.  And I have no idea if this behavior is well-enough
known that it's even been fixed in some way in a recent kernel anyway;
anyone know?)

Thanks much...

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Reply via email to