Re: [Jfs-discussion] running complete filesystem check with 100.000s of files

Peter Grandi Wed, 30 Apr 2008 14:52:13 -0700

[ ... ]

>> This and "very busy storing/processing new files (24h/day)"
>> later seem to describe a fairly critical system with somewhat
>> high availability requirements.


[ ... ]

>> Indeed, and if one has availability constraints, relying on
>> 'fsck' being quick is equally unrealistic.

> That's an interesting comment - I guess I _have_ been relying
> on 1) the system only rarely needing a reboot and 2) a fast
> fsck when it happens.

Plenty of people do that, and then bad news do happen. I was
some time ago at a workshop about large scale system admin at
big national research labs (CERN and so on) and I asked almost
every speaker what they were doing about filesystem checking
times, and some seemed to be unaware of the issue.

The main driver of the issue is that thanks to RAID of various
sorts it is easy to scale up capacity and read or write accesses,
but 'fsck' does not take advantage of the multiple spindles in
RAID because it is serial.

> Do you have any insights to share wrt availability, large
> filesystems (up to 1Tb in our case) and millions of files?
> (apart from "don't do it" :-)

>From other things you have written it looks like that you use
the filesystem as a structured database. "don't do it" :-)

Usually it is better to use a database manager instead if you want
to store many records, instead of a filesystem.

However filesystems can grow in their own way, without remotely
looking like structured databases. For example a 200TB repository
with 100M files.

As to that, in general the only way I can see now to do that is
via clusters of very many smaller filesystems, each of which can
be either repaired or restored from backup pretty quickly, which
means 1-4TB and hundreds of thousands of inodes.

Some notes I have written on the subject:

  http://www.sabi.co.uk/blog/0804apr.html#080417
  http://www.sabi.co.uk/blog/0804apr.html#080407

[ ... ]

>> The time taken to do a deep check of entangled filesystems can
>> be long. For an 'ext3' filesystem it was 75 days, and there are
>> other interesting reports of long 'fsck' times:

> Uh oh.  I guess I'd better move ahead with my new system, and
> hope to migrate whatever I can later on.

In your case for up to 1TB the best stragtegy is probably frequent
backups and then in case of trouble a quick restore copying back a
the whole disk using 'dd'. With FW800 or eSATA I get around 50MB
sustained average (better with O_DIRECT and large block sizes,
which I now prefer) when duplicating modern cheap 500GB drives:

  http://www.sabi.co.uk/blog/0705may.html#070505

Of course if you have a "warm" backup you can just swap in the
backup drive and do an offline 'fsck' if really necessary on the
swapped out damaged filesystem. There are several options that
are advisable depending on circumstances.

>> but I haven't found (even on this mailing list) many reports of
>> 'fsck' durations for JFS, and my own filesystems are rather small
>> like yours (a few hundred thousand files, a few hundred GB of
>> data), and 'fsck' takes a few minutes on undamaged or mostly OK
>> filesystems.

> That has been my experience too - right up until 28 April at around
> 20:00. :-(

That's because even 'jfs_fsck -f' is quite quick on clean
filesystems, the problem is deep scans on messed up filesystems.

Some numbers for clean filesystems:

  --------------------------------------------------------------
  # sysctl vm/drop_caches=3; time jfs_fsck -f /dev/sda8
  vm.drop_caches = 3
  jfs_fsck version 1.1.12, 24-Aug-2007
  processing started: 4/30/2008 21.52.28
  The current device is:  /dev/sda8
  Block size in bytes:  4096
  Filesystem size in blocks:  61046992
  **Phase 0 - Replay Journal Log
  **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
  **Phase 2 - Count links
  **Phase 3 - Duplicate Block Rescan and Directory Connectedness
  **Phase 4 - Report Problems
  **Phase 5 - Check Connectivity
  **Phase 6 - Perform Approved Corrections
  **Phase 7 - Rebuild File/Directory Allocation Maps
  **Phase 8 - Rebuild Disk Allocation Maps
  244187968 kilobytes total disk space.
      27013 kilobytes in 8242 directories.
  173119730 kilobytes in 118301 user files.
      10896 kilobytes in extended attributes
     128391 kilobytes reserved for system use.
   70955964 kilobytes are available for use.
  Filesystem is clean.

  real    1m2.159s
  user    0m1.530s
  sys     0m1.630s
  --------------------------------------------------------------

That's on a 2004 class desktop machine and it is doing almost
2,000 inodes/s; I get similar results this is instead for a
contemporary chunky server on an 8 drive RAID10:

  --------------------------------------------------------------
  # sysctl vm/drop_caches=2; time jfs_fsck -f /dev/md0
  vm.drop_caches = 2
  jfs_fsck version 1.1.12, 24-Aug-2007
  processing started: 4/30/2008 21.58.17
  The current device is:  /dev/md0
  Block size in bytes:  4096
  Filesystem size in blocks:  390070208
  **Phase 0 - Replay Journal Log
  **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
  **Phase 2 - Count links
  **Phase 3 - Duplicate Block Rescan and Directory Connectedness
  **Phase 4 - Report Problems
  **Phase 5 - Check Connectivity
  **Phase 6 - Perform Approved Corrections
  **Phase 7 - Rebuild File/Directory Allocation Maps
  **Phase 8 - Rebuild Disk Allocation Maps
  1560280832 kilobytes total disk space.
     108545 kilobytes in 82098 directories.
   17122632 kilobytes in 251457 user files.
        100 kilobytes in extended attributes
     496649 kilobytes reserved for system use.
  1542769996 kilobytes are available for use.
  Filesystem is clean.

  real    0m55.271s
  user    0m2.428s
  sys     0m4.127s
  --------------------------------------------------------------

That's a large filesystem mostly empty because I use that for
testing and the particular test was a lot of very small files
and yet it does 4-5,000 inodes/s.

A million inodes? Probably 4-5 minutes. But it is not a deep
scan over a messed up filesystem.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Re: [Jfs-discussion] running complete filesystem check with 100.000s of files

Reply via email to