I've had this problem for -years- now, and I'm totally fed up with it. I've got a machine with an unstable piece of hardware in it. Every so often, it locks up, and I have to push the reset button. With some low (but far from zero) probability, this causes the last file written to one of its JFS filesystems to VANISH when the machine comes back up.
Today, this happened with a file written FIVE HOURS before I had to reset the machine, with NO activity on the JFS since then. The machine runs headless and automounts these partitions, and I strongly suspect that what's happening is that fsck runs on the partition and decides that one file is too damaged to keep, and just flushes it. Unfortunately, I have zero logging information from the fsck available. (I did, however, note the same behavior once when the machine refused to mount the JFS and I had to run fsck.jfs by hand, so it's clear that this is occasionally happening.) WHY, oh WHY, has JFS not committed all of its journal info HOURS beforehand (like, within the standard 30 second pdflush interval) such that the FS doesn't kill the most-recently-written file in this case? If ext3 wasn't so godawful slow at deleting large files, I'd dump JFS in a heartbeat over this, since at least ext3 doesn't just randomly misplace files after a crash. Is there anything I -can- do to keep it from doing this? Saying "just unmount the FS before you push reset" is a nonstarter, since the resets themselves are occasional and I never know when I'll need to do so and by the time it's necessary the machine is completely hung anyway, and the FS is generally in pretty-constant use and I'd have to kill a lot of jobs just to unmount it (including, also, NFS, but also the normal programs running on the machine, which tend to read a few multi-GB files an hour and write a few of them an hour as well). Is there -anything- I can do to improve my odds of NOT having JFS simply eat a file if I have to reset the machine? (This is quite an old kernel by now---2.6.12 in Ubuntu---but I can't upgrade the machine in any reasonable timeframe due to a raft of other considerations. And I have no idea if this behavior is well-enough known that it's even been fixed in some way in a recent kernel anyway; anyone know?) Thanks much... ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Jfs-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/jfs-discussion
