RE: fsck.ext3 on large file systems?

2008-03-28 Thread Bly, MJ (Martin)
 -Original Message-
 From: Jon Peatfield [mailto:[EMAIL PROTECTED] 
 Sent: 28 March 2008 00:40
 To: Bly, MJ (Martin)
 Cc: [EMAIL PROTECTED]
 Subject: RE: fsck.ext3 on large file systems?
 
 ...
 At least one pass of the ext3 fsck involves checking every 
 inode table 
 entry so '-T largefile4' would help you since you will get 
 smaller numbers 
 of inodes.  As the inode tables are spread through the disk 
 it will read 
 the various chunks then seek off somewhere and read more...

We specify the number of Inodes too - roughly one per MB.  I guess it
helps us in that most of our data is HEP data with large (of the order
GB) file sizes.

 [ one of the planned features for ext4 is a way to safely 
 mark than an 
 entire inode-table lump is unused to save things like fsck 
 from having to 
 scan all those unusd blocks.  Of course doing so safely isn't quite 
 trivial and it causes problems with the current model of how 
 to choose the 
 locations for an inode for a new file... ]
 
  I'd be very suspicious of a HW RAID controller that took 
 'days' to fsck
  a file system unless the filesystem was already in serious 
 trouble, and
  from bitter experience, fsck on a filesystem with holes it 
 it caused by
  a bad raid controller interconnect (SCSI!) can do more 
 damage than good.
 
 To give you one example at least one fsck pass needs to check 
 that every 
 inode in use has the right (link-count) number of directory entries 
 pointing at it.  The current ext3 fsck seems to do a good 
 impersonation of 
 a linear search though it's in-memory inode-state-table for 
 each directory 
 entry - at least for files with non-trivial link-counts.
 
 A trivial analysis shows that such a set of checks would be 
 O(n^2) in the 
 number of files needing to be checked, not counting the performance 
 problems when the 'in-memory' tables get too big for ram...
 
 [ In case I'm slandering the ext3 fsck people - I've not 
 actually checked 
 the ext3 fsck code really does anything as simple as a linear 
 search but 
 anything more complex will need to use more memory and ... ]
 
 Last year we were trying to fsck a ~6.8TB ext3 fs which was about 70% 
 filled with hundereds of hard-link trees of home directories. 
  So huge 
 numbers of inode entries (many/most files are small), and having a 
 link-counts of say 150.  Our poor server had only 8G ram and 
 the ext3 fsck 
 wanted rather a lot more.  Obviously in such a case it will be *slow*.

Indeed I recall some years ago needing to fsck some quite small
filesystems by modern standards and being putout by the time it took due
to do it - down entirely to users with massively linked structures.  Of
coure RAM counst were lower then, too.

 Of course that was after we built a version which didn't 
 simply go into an 
 infinite loop somewhere between 3 and 4TB into scanning through the 
 inode-table.
 
 Now as you can guess any dump needs to do much the same kind 
 of work wrt 
 scanning the inodes, looking for hard-links, so you may not 
 be shocked to 
 discover that attempting a backup was rather slow too...

Been there, know what you mean.

 
 -- 
 Jon Peatfield,  Computer Officer,  DAMTP,  University of Cambridge
 Mail:  [EMAIL PROTECTED] Web:  http://www.damtp.cam.ac.uk/



Martin.
-- 
Martin Bly
RAL Tier1 Fabric Team 


Re: fsck.ext3 on large file systems?

2008-03-27 Thread Stephen John Smoogen
On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote:
 Greetings.  We have a lately had a lot of trouble with relatively large
  (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes.  The
  file systems in question are based on ext3.

  In a typical scenario, we have a drive go bad in a RAID array.  We then
  remove it from the array, if it isn't already, add a new hard drive
  (i.e., by hand, not from a hot spare), and add it back to the RAID
  array.  The RAID operations are all done using mdadm.

  After the RAID array has completed its rebuild, we run fsck on the RAID
  device.  When we do that, fsck seems to run forever, i.e., for days at a
  time, occasionally spitting out messages about files with recognizable
  names, but never completing satisfactorily.


fsck of 1TB is going to take days  due to the linear nature of it
checking the disk. [ I think the disks for mirrors.kernel.org take
many weeks to fsck.] The bigger question is what kind of data are you
writing to these disks, and is the ext3 journal large enough for those
writes?


  The systems in question are typically running SL 4.x.  We've read that
  the version of fsck that is standard in SL 4 has some known bugs,
  especially wrt large file systems.

  Hence, we've attempted to repeat the exercise with fsck.ext3 taken from
  the Fedora 8 distribution.  This gives us improved, but still not
  satisfactory, results.


Did you recompile the binary from source, or did you use it straight?
I am just wondering if fsck is dependant on some kernel particulars...

To tell you the truth, I have not done anything with Linux Raid in the
Terabyte range.. Usually I go with a hardware solution at that point
(usually for business reasons.. that much storage usually comes with a
box with hardware raid). I did run into a similar issue though trying
to help someone last week on a SuSE box with ext3. They also had a
long fsck and weird file names coming up. I think they went with the
same solution ( restore from backups).



-- 
Stephen J Smoogen. -- CSIRT/Linux System Administrator
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. The Merchant of Venice


Re: fsck.ext3 on large file systems?

2008-03-27 Thread Stephan . Wiesand

On Thu, 27 Mar 2008, Stephen John Smoogen wrote:


On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote:

Greetings.  We have a lately had a lot of trouble with relatively large
 (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes.  The
 file systems in question are based on ext3.

 In a typical scenario, we have a drive go bad in a RAID array.  We then
 remove it from the array, if it isn't already, add a new hard drive
 (i.e., by hand, not from a hot spare), and add it back to the RAID
 array.  The RAID operations are all done using mdadm.

 After the RAID array has completed its rebuild, we run fsck on the RAID
 device.  When we do that, fsck seems to run forever, i.e., for days at a
 time, occasionally spitting out messages about files with recognizable
 names, but never completing satisfactorily.



fsck of 1TB is going to take days  due to the linear nature of it


Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a 
couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla 
kernel). This took no more than a few hours (maybe 2,3, or 4).  It was 
hardware RAID, not too reliable (hence frequently), and not too fast ( 
100 MB/s). A contemporary linux server with software RAID should complete 
an fsck *much* faster, or something is wrong.


And I still wonder why fsck at at all just because a broken disk was 
replaced in a redundant array?


--
Stephan Wiesand
  DESY - DV -
  Platanenallee 6
  15738 Zeuthen, Germany


Re: fsck.ext3 on large file systems?

2008-03-27 Thread Michael Hannon
On Thu, Mar 27, 2008 at 06:56:17AM -0600, Stephen John Smoogen wrote:
 On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote:
  Greetings.  We have a lately had a lot of trouble with relatively large
   (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes.  The
   file systems in question are based on ext3.
.
.
.
   After the RAID array has completed its rebuild, we run fsck on the RAID
   device.  When we do that, fsck seems to run forever, i.e., for days at a
   time, occasionally spitting out messages about files with recognizable
   names, but never completing satisfactorily.
 
 
 fsck of 1TB is going to take days  due to the linear nature of it
 checking the disk. [ I think the disks for mirrors.kernel.org take
 many weeks to fsck.] The bigger question is what kind of data are you
 writing to these disks, and is the ext3 journal large enough for those
 writes?
 
 
   The systems in question are typically running SL 4.x.  We've read that
   the version of fsck that is standard in SL 4 has some known bugs,
   especially wrt large file systems.
 
   Hence, we've attempted to repeat the exercise with fsck.ext3 taken from
   the Fedora 8 distribution.  This gives us improved, but still not
   satisfactory, results.
 
 
 Did you recompile the binary from source, or did you use it straight?
 I am just wondering if fsck is dependant on some kernel particulars...

We just used the binary straight.  Thanks, Stephen.

- Mike
-- 
Michael Hannonmailto:[EMAIL PROTECTED]
Dept. of Physics  530.752.4966
University of California  530.752.4717 FAX
Davis, CA 95616-8677


Re: fsck.ext3 on large file systems?

2008-03-27 Thread Michael Hannon
On Thu, Mar 27, 2008 at 05:06:53PM +0100, [EMAIL PROTECTED] wrote:
 On Thu, 27 Mar 2008, Stephen John Smoogen wrote:
.
.
.
 Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a 
 couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla 
 kernel). This took no more than a few hours (maybe 2,3, or 4).  It was 
 hardware RAID, not too reliable (hence frequently), and not too fast ( 
 100 MB/s). A contemporary linux server with software RAID should complete 
 an fsck *much* faster, or something is wrong.

Hi, Stephan.  Yea, I think I must be doing something wrong here, but I
haven't been able to figure out what it is.

 And I still wonder why fsck at at all just because a broken disk was 
 replaced in a redundant array?

The system seems to insist on it.  Again, there may be some cockpit
error involved.

Thanks.

- Mike
-- 
Michael Hannonmailto:[EMAIL PROTECTED]
Dept. of Physics  530.752.4966
University of California  530.752.4717 FAX
Davis, CA 95616-8677


Re: fsck.ext3 on large file systems?

2008-03-27 Thread Stephen John Smoogen
On Thu, Mar 27, 2008 at 4:40 PM, Michael Hannon [EMAIL PROTECTED] wrote:
 On Thu, Mar 27, 2008 at 05:06:53PM +0100, [EMAIL PROTECTED] wrote:
   On Thu, 27 Mar 2008, Stephen John Smoogen wrote:
  .
  .

 .
   Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a
   couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla
   kernel). This took no more than a few hours (maybe 2,3, or 4).  It was
   hardware RAID, not too reliable (hence frequently), and not too fast (
   100 MB/s). A contemporary linux server with software RAID should complete
   an fsck *much* faster, or something is wrong.

  Hi, Stephan.  Yea, I think I must be doing something wrong here, but I
  haven't been able to figure out what it is.


Well this could be a hardware error on the wire (bad scsi wire etc).
It could also depend on how the data is laid out on the disk. The long
fsck's were tons of directories and files.. and they were read,
deleted, etc in random order (eg INN news). but then again it could be
that I had crap hardware then too :).


   And I still wonder why fsck at at all just because a broken disk was
   replaced in a redundant array?

  The system seems to insist on it.  Again, there may be some cockpit
  error involved.


Usually it will require an fsck if the disk did not shutdown clearly
or some other issue it is detecting. I would need to know more about
the hardware in place (controller, number of drives, type of drives,
how many spares etc) to make a more educated guess.



-- 
Stephen J Smoogen. -- CSIRT/Linux System Administrator
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. The Merchant of Venice


RE: fsck.ext3 on large file systems?

2008-03-27 Thread Jon Peatfield

On Thu, 27 Mar 2008, Bly, MJ (Martin) wrote:

snip

On our hardware RAID arrays (3ware, Areca, Infortrend) with many (12/14)
SATA disks, 500/750GB each, we fsck 2TB+ ext3 filesystems (as
infrequently as possible!) and it takes ~2 hours each.  We have some
5.5TB arrays that take less than three hours.  Note that these are
created with '-T largefile4 -O dir_index' among other options.


At least one pass of the ext3 fsck involves checking every inode table 
entry so '-T largefile4' would help you since you will get smaller numbers 
of inodes.  As the inode tables are spread through the disk it will read 
the various chunks then seek off somewhere and read more...


[ one of the planned features for ext4 is a way to safely mark than an 
entire inode-table lump is unused to save things like fsck from having to 
scan all those unusd blocks.  Of course doing so safely isn't quite 
trivial and it causes problems with the current model of how to choose the 
locations for an inode for a new file... ]



I'd be very suspicious of a HW RAID controller that took 'days' to fsck
a file system unless the filesystem was already in serious trouble, and
from bitter experience, fsck on a filesystem with holes it it caused by
a bad raid controller interconnect (SCSI!) can do more damage than good.


To give you one example at least one fsck pass needs to check that every 
inode in use has the right (link-count) number of directory entries 
pointing at it.  The current ext3 fsck seems to do a good impersonation of 
a linear search though it's in-memory inode-state-table for each directory 
entry - at least for files with non-trivial link-counts.


A trivial analysis shows that such a set of checks would be O(n^2) in the 
number of files needing to be checked, not counting the performance 
problems when the 'in-memory' tables get too big for ram...


[ In case I'm slandering the ext3 fsck people - I've not actually checked 
the ext3 fsck code really does anything as simple as a linear search but 
anything more complex will need to use more memory and ... ]


Last year we were trying to fsck a ~6.8TB ext3 fs which was about 70% 
filled with hundereds of hard-link trees of home directories.  So huge 
numbers of inode entries (many/most files are small), and having a 
link-counts of say 150.  Our poor server had only 8G ram and the ext3 fsck 
wanted rather a lot more.  Obviously in such a case it will be *slow*.


Of course that was after we built a version which didn't simply go into an 
infinite loop somewhere between 3 and 4TB into scanning through the 
inode-table.


Now as you can guess any dump needs to do much the same kind of work wrt 
scanning the inodes, looking for hard-links, so you may not be shocked to 
discover that attempting a backup was rather slow too...


--
Jon Peatfield,  Computer Officer,  DAMTP,  University of Cambridge
Mail:  [EMAIL PROTECTED] Web:  http://www.damtp.cam.ac.uk/


Re: fsck.ext3 on large file systems?

2008-03-26 Thread John Summerfield

Michael Hannon wrote:

Greetings.  We have a lately had a lot of trouble with relatively large
(order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes.  The
file systems in question are based on ext3.

In a typical scenario, we have a drive go bad in a RAID array.  We then
remove it from the array, if it isn't already, add a new hard drive
(i.e., by hand, not from a hot spare), and add it back to the RAID
array.  The RAID operations are all done using mdadm.

After the RAID array has completed its rebuild, we run fsck on the RAID
device.  When we do that, fsck seems to run forever, i.e., for days at a
time, occasionally spitting out messages about files with recognizable
names, but never completing satisfactorily.


It would be very interesting to try to replicate the fsck on a single 
SATA drive. Several vendors have 1000 Gbyte drives for less than I 
recall paying for a Bigfoot.


If needs be, you could stripe two to get the size up, but a single drive 
eliminates the complexities of your current setup.


I imaging if a desktop pc took days to fsck, someone would have noticed.




--

Cheers
John

-- spambait
[EMAIL PROTECTED]  [EMAIL PROTECTED]
-- Advice
http://webfoot.com/advice/email.top.php
http://www.catb.org/~esr/faqs/smart-questions.html
http://support.microsoft.com/kb/555375

You cannot reply off-list:-)