[ ... ] >> This and "very busy storing/processing new files (24h/day)" >> later seem to describe a fairly critical system with somewhat >> high availability requirements.
[ ... ] >> Indeed, and if one has availability constraints, relying on >> 'fsck' being quick is equally unrealistic. > That's an interesting comment - I guess I _have_ been relying > on 1) the system only rarely needing a reboot and 2) a fast > fsck when it happens. Plenty of people do that, and then bad news do happen. I was some time ago at a workshop about large scale system admin at big national research labs (CERN and so on) and I asked almost every speaker what they were doing about filesystem checking times, and some seemed to be unaware of the issue. The main driver of the issue is that thanks to RAID of various sorts it is easy to scale up capacity and read or write accesses, but 'fsck' does not take advantage of the multiple spindles in RAID because it is serial. > Do you have any insights to share wrt availability, large > filesystems (up to 1Tb in our case) and millions of files? > (apart from "don't do it" :-) >From other things you have written it looks like that you use the filesystem as a structured database. "don't do it" :-) Usually it is better to use a database manager instead if you want to store many records, instead of a filesystem. However filesystems can grow in their own way, without remotely looking like structured databases. For example a 200TB repository with 100M files. As to that, in general the only way I can see now to do that is via clusters of very many smaller filesystems, each of which can be either repaired or restored from backup pretty quickly, which means 1-4TB and hundreds of thousands of inodes. Some notes I have written on the subject: http://www.sabi.co.uk/blog/0804apr.html#080417 http://www.sabi.co.uk/blog/0804apr.html#080407 [ ... ] >> The time taken to do a deep check of entangled filesystems can >> be long. For an 'ext3' filesystem it was 75 days, and there are >> other interesting reports of long 'fsck' times: > Uh oh. I guess I'd better move ahead with my new system, and > hope to migrate whatever I can later on. In your case for up to 1TB the best stragtegy is probably frequent backups and then in case of trouble a quick restore copying back a the whole disk using 'dd'. With FW800 or eSATA I get around 50MB sustained average (better with O_DIRECT and large block sizes, which I now prefer) when duplicating modern cheap 500GB drives: http://www.sabi.co.uk/blog/0705may.html#070505 Of course if you have a "warm" backup you can just swap in the backup drive and do an offline 'fsck' if really necessary on the swapped out damaged filesystem. There are several options that are advisable depending on circumstances. >> but I haven't found (even on this mailing list) many reports of >> 'fsck' durations for JFS, and my own filesystems are rather small >> like yours (a few hundred thousand files, a few hundred GB of >> data), and 'fsck' takes a few minutes on undamaged or mostly OK >> filesystems. > That has been my experience too - right up until 28 April at around > 20:00. :-( That's because even 'jfs_fsck -f' is quite quick on clean filesystems, the problem is deep scans on messed up filesystems. Some numbers for clean filesystems: -------------------------------------------------------------- # sysctl vm/drop_caches=3; time jfs_fsck -f /dev/sda8 vm.drop_caches = 3 jfs_fsck version 1.1.12, 24-Aug-2007 processing started: 4/30/2008 21.52.28 The current device is: /dev/sda8 Block size in bytes: 4096 Filesystem size in blocks: 61046992 **Phase 0 - Replay Journal Log **Phase 1 - Check Blocks, Files/Directories, and Directory Entries **Phase 2 - Count links **Phase 3 - Duplicate Block Rescan and Directory Connectedness **Phase 4 - Report Problems **Phase 5 - Check Connectivity **Phase 6 - Perform Approved Corrections **Phase 7 - Rebuild File/Directory Allocation Maps **Phase 8 - Rebuild Disk Allocation Maps 244187968 kilobytes total disk space. 27013 kilobytes in 8242 directories. 173119730 kilobytes in 118301 user files. 10896 kilobytes in extended attributes 128391 kilobytes reserved for system use. 70955964 kilobytes are available for use. Filesystem is clean. real 1m2.159s user 0m1.530s sys 0m1.630s -------------------------------------------------------------- That's on a 2004 class desktop machine and it is doing almost 2,000 inodes/s; I get similar results this is instead for a contemporary chunky server on an 8 drive RAID10: -------------------------------------------------------------- # sysctl vm/drop_caches=2; time jfs_fsck -f /dev/md0 vm.drop_caches = 2 jfs_fsck version 1.1.12, 24-Aug-2007 processing started: 4/30/2008 21.58.17 The current device is: /dev/md0 Block size in bytes: 4096 Filesystem size in blocks: 390070208 **Phase 0 - Replay Journal Log **Phase 1 - Check Blocks, Files/Directories, and Directory Entries **Phase 2 - Count links **Phase 3 - Duplicate Block Rescan and Directory Connectedness **Phase 4 - Report Problems **Phase 5 - Check Connectivity **Phase 6 - Perform Approved Corrections **Phase 7 - Rebuild File/Directory Allocation Maps **Phase 8 - Rebuild Disk Allocation Maps 1560280832 kilobytes total disk space. 108545 kilobytes in 82098 directories. 17122632 kilobytes in 251457 user files. 100 kilobytes in extended attributes 496649 kilobytes reserved for system use. 1542769996 kilobytes are available for use. Filesystem is clean. real 0m55.271s user 0m2.428s sys 0m4.127s -------------------------------------------------------------- That's a large filesystem mostly empty because I use that for testing and the particular test was a lot of very small files and yet it does 4-5,000 inodes/s. A million inodes? Probably 4-5 minutes. But it is not a deep scan over a messed up filesystem. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Jfs-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/jfs-discussion
