RE: fsck.ext3 on large file systems?
-Original Message- From: Jon Peatfield [mailto:[EMAIL PROTECTED] Sent: 28 March 2008 00:40 To: Bly, MJ (Martin) Cc: [EMAIL PROTECTED] Subject: RE: fsck.ext3 on large file systems? ... At least one pass of the ext3 fsck involves checking every inode table entry so '-T largefile4' would help you since you will get smaller numbers of inodes. As the inode tables are spread through the disk it will read the various chunks then seek off somewhere and read more... We specify the number of Inodes too - roughly one per MB. I guess it helps us in that most of our data is HEP data with large (of the order GB) file sizes. [ one of the planned features for ext4 is a way to safely mark than an entire inode-table lump is unused to save things like fsck from having to scan all those unusd blocks. Of course doing so safely isn't quite trivial and it causes problems with the current model of how to choose the locations for an inode for a new file... ] I'd be very suspicious of a HW RAID controller that took 'days' to fsck a file system unless the filesystem was already in serious trouble, and from bitter experience, fsck on a filesystem with holes it it caused by a bad raid controller interconnect (SCSI!) can do more damage than good. To give you one example at least one fsck pass needs to check that every inode in use has the right (link-count) number of directory entries pointing at it. The current ext3 fsck seems to do a good impersonation of a linear search though it's in-memory inode-state-table for each directory entry - at least for files with non-trivial link-counts. A trivial analysis shows that such a set of checks would be O(n^2) in the number of files needing to be checked, not counting the performance problems when the 'in-memory' tables get too big for ram... [ In case I'm slandering the ext3 fsck people - I've not actually checked the ext3 fsck code really does anything as simple as a linear search but anything more complex will need to use more memory and ... ] Last year we were trying to fsck a ~6.8TB ext3 fs which was about 70% filled with hundereds of hard-link trees of home directories. So huge numbers of inode entries (many/most files are small), and having a link-counts of say 150. Our poor server had only 8G ram and the ext3 fsck wanted rather a lot more. Obviously in such a case it will be *slow*. Indeed I recall some years ago needing to fsck some quite small filesystems by modern standards and being putout by the time it took due to do it - down entirely to users with massively linked structures. Of coure RAM counst were lower then, too. Of course that was after we built a version which didn't simply go into an infinite loop somewhere between 3 and 4TB into scanning through the inode-table. Now as you can guess any dump needs to do much the same kind of work wrt scanning the inodes, looking for hard-links, so you may not be shocked to discover that attempting a backup was rather slow too... Been there, know what you mean. -- Jon Peatfield, Computer Officer, DAMTP, University of Cambridge Mail: [EMAIL PROTECTED] Web: http://www.damtp.cam.ac.uk/ Martin. -- Martin Bly RAL Tier1 Fabric Team
Re: fsck.ext3 on large file systems?
On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote: Greetings. We have a lately had a lot of trouble with relatively large (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes. The file systems in question are based on ext3. In a typical scenario, we have a drive go bad in a RAID array. We then remove it from the array, if it isn't already, add a new hard drive (i.e., by hand, not from a hot spare), and add it back to the RAID array. The RAID operations are all done using mdadm. After the RAID array has completed its rebuild, we run fsck on the RAID device. When we do that, fsck seems to run forever, i.e., for days at a time, occasionally spitting out messages about files with recognizable names, but never completing satisfactorily. fsck of 1TB is going to take days due to the linear nature of it checking the disk. [ I think the disks for mirrors.kernel.org take many weeks to fsck.] The bigger question is what kind of data are you writing to these disks, and is the ext3 journal large enough for those writes? The systems in question are typically running SL 4.x. We've read that the version of fsck that is standard in SL 4 has some known bugs, especially wrt large file systems. Hence, we've attempted to repeat the exercise with fsck.ext3 taken from the Fedora 8 distribution. This gives us improved, but still not satisfactory, results. Did you recompile the binary from source, or did you use it straight? I am just wondering if fsck is dependant on some kernel particulars... To tell you the truth, I have not done anything with Linux Raid in the Terabyte range.. Usually I go with a hardware solution at that point (usually for business reasons.. that much storage usually comes with a box with hardware raid). I did run into a similar issue though trying to help someone last week on a SuSE box with ext3. They also had a long fsck and weird file names coming up. I think they went with the same solution ( restore from backups). -- Stephen J Smoogen. -- CSIRT/Linux System Administrator How far that little candle throws his beams! So shines a good deed in a naughty world. = Shakespeare. The Merchant of Venice
Re: fsck.ext3 on large file systems?
On Thu, 27 Mar 2008, Stephen John Smoogen wrote: On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote: Greetings. We have a lately had a lot of trouble with relatively large (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes. The file systems in question are based on ext3. In a typical scenario, we have a drive go bad in a RAID array. We then remove it from the array, if it isn't already, add a new hard drive (i.e., by hand, not from a hot spare), and add it back to the RAID array. The RAID operations are all done using mdadm. After the RAID array has completed its rebuild, we run fsck on the RAID device. When we do that, fsck seems to run forever, i.e., for days at a time, occasionally spitting out messages about files with recognizable names, but never completing satisfactorily. fsck of 1TB is going to take days due to the linear nature of it Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla kernel). This took no more than a few hours (maybe 2,3, or 4). It was hardware RAID, not too reliable (hence frequently), and not too fast ( 100 MB/s). A contemporary linux server with software RAID should complete an fsck *much* faster, or something is wrong. And I still wonder why fsck at at all just because a broken disk was replaced in a redundant array? -- Stephan Wiesand DESY - DV - Platanenallee 6 15738 Zeuthen, Germany
Re: fsck.ext3 on large file systems?
On Thu, Mar 27, 2008 at 06:56:17AM -0600, Stephen John Smoogen wrote: On Wed, Mar 26, 2008 at 7:34 PM, Michael Hannon [EMAIL PROTECTED] wrote: Greetings. We have a lately had a lot of trouble with relatively large (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes. The file systems in question are based on ext3. . . . After the RAID array has completed its rebuild, we run fsck on the RAID device. When we do that, fsck seems to run forever, i.e., for days at a time, occasionally spitting out messages about files with recognizable names, but never completing satisfactorily. fsck of 1TB is going to take days due to the linear nature of it checking the disk. [ I think the disks for mirrors.kernel.org take many weeks to fsck.] The bigger question is what kind of data are you writing to these disks, and is the ext3 journal large enough for those writes? The systems in question are typically running SL 4.x. We've read that the version of fsck that is standard in SL 4 has some known bugs, especially wrt large file systems. Hence, we've attempted to repeat the exercise with fsck.ext3 taken from the Fedora 8 distribution. This gives us improved, but still not satisfactory, results. Did you recompile the binary from source, or did you use it straight? I am just wondering if fsck is dependant on some kernel particulars... We just used the binary straight. Thanks, Stephen. - Mike -- Michael Hannonmailto:[EMAIL PROTECTED] Dept. of Physics 530.752.4966 University of California 530.752.4717 FAX Davis, CA 95616-8677
Re: fsck.ext3 on large file systems?
On Thu, Mar 27, 2008 at 05:06:53PM +0100, [EMAIL PROTECTED] wrote: On Thu, 27 Mar 2008, Stephen John Smoogen wrote: . . . Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla kernel). This took no more than a few hours (maybe 2,3, or 4). It was hardware RAID, not too reliable (hence frequently), and not too fast ( 100 MB/s). A contemporary linux server with software RAID should complete an fsck *much* faster, or something is wrong. Hi, Stephan. Yea, I think I must be doing something wrong here, but I haven't been able to figure out what it is. And I still wonder why fsck at at all just because a broken disk was replaced in a redundant array? The system seems to insist on it. Again, there may be some cockpit error involved. Thanks. - Mike -- Michael Hannonmailto:[EMAIL PROTECTED] Dept. of Physics 530.752.4966 University of California 530.752.4717 FAX Davis, CA 95616-8677
Re: fsck.ext3 on large file systems?
On Thu, Mar 27, 2008 at 4:40 PM, Michael Hannon [EMAIL PROTECTED] wrote: On Thu, Mar 27, 2008 at 05:06:53PM +0100, [EMAIL PROTECTED] wrote: On Thu, 27 Mar 2008, Stephen John Smoogen wrote: . . . Hmm, we successfully fsck'd ext3 filesystems 1.4 TB in size frequently a couple of years ago, under 2.4 (back then, it was SuSE 8.2 + a vanilla kernel). This took no more than a few hours (maybe 2,3, or 4). It was hardware RAID, not too reliable (hence frequently), and not too fast ( 100 MB/s). A contemporary linux server with software RAID should complete an fsck *much* faster, or something is wrong. Hi, Stephan. Yea, I think I must be doing something wrong here, but I haven't been able to figure out what it is. Well this could be a hardware error on the wire (bad scsi wire etc). It could also depend on how the data is laid out on the disk. The long fsck's were tons of directories and files.. and they were read, deleted, etc in random order (eg INN news). but then again it could be that I had crap hardware then too :). And I still wonder why fsck at at all just because a broken disk was replaced in a redundant array? The system seems to insist on it. Again, there may be some cockpit error involved. Usually it will require an fsck if the disk did not shutdown clearly or some other issue it is detecting. I would need to know more about the hardware in place (controller, number of drives, type of drives, how many spares etc) to make a more educated guess. -- Stephen J Smoogen. -- CSIRT/Linux System Administrator How far that little candle throws his beams! So shines a good deed in a naughty world. = Shakespeare. The Merchant of Venice
RE: fsck.ext3 on large file systems?
On Thu, 27 Mar 2008, Bly, MJ (Martin) wrote: snip On our hardware RAID arrays (3ware, Areca, Infortrend) with many (12/14) SATA disks, 500/750GB each, we fsck 2TB+ ext3 filesystems (as infrequently as possible!) and it takes ~2 hours each. We have some 5.5TB arrays that take less than three hours. Note that these are created with '-T largefile4 -O dir_index' among other options. At least one pass of the ext3 fsck involves checking every inode table entry so '-T largefile4' would help you since you will get smaller numbers of inodes. As the inode tables are spread through the disk it will read the various chunks then seek off somewhere and read more... [ one of the planned features for ext4 is a way to safely mark than an entire inode-table lump is unused to save things like fsck from having to scan all those unusd blocks. Of course doing so safely isn't quite trivial and it causes problems with the current model of how to choose the locations for an inode for a new file... ] I'd be very suspicious of a HW RAID controller that took 'days' to fsck a file system unless the filesystem was already in serious trouble, and from bitter experience, fsck on a filesystem with holes it it caused by a bad raid controller interconnect (SCSI!) can do more damage than good. To give you one example at least one fsck pass needs to check that every inode in use has the right (link-count) number of directory entries pointing at it. The current ext3 fsck seems to do a good impersonation of a linear search though it's in-memory inode-state-table for each directory entry - at least for files with non-trivial link-counts. A trivial analysis shows that such a set of checks would be O(n^2) in the number of files needing to be checked, not counting the performance problems when the 'in-memory' tables get too big for ram... [ In case I'm slandering the ext3 fsck people - I've not actually checked the ext3 fsck code really does anything as simple as a linear search but anything more complex will need to use more memory and ... ] Last year we were trying to fsck a ~6.8TB ext3 fs which was about 70% filled with hundereds of hard-link trees of home directories. So huge numbers of inode entries (many/most files are small), and having a link-counts of say 150. Our poor server had only 8G ram and the ext3 fsck wanted rather a lot more. Obviously in such a case it will be *slow*. Of course that was after we built a version which didn't simply go into an infinite loop somewhere between 3 and 4TB into scanning through the inode-table. Now as you can guess any dump needs to do much the same kind of work wrt scanning the inodes, looking for hard-links, so you may not be shocked to discover that attempting a backup was rather slow too... -- Jon Peatfield, Computer Officer, DAMTP, University of Cambridge Mail: [EMAIL PROTECTED] Web: http://www.damtp.cam.ac.uk/
Re: fsck.ext3 on large file systems?
Michael Hannon wrote: Greetings. We have a lately had a lot of trouble with relatively large (order of 1TB) file systems mounted on RAID 5 or RAID 6 volumes. The file systems in question are based on ext3. In a typical scenario, we have a drive go bad in a RAID array. We then remove it from the array, if it isn't already, add a new hard drive (i.e., by hand, not from a hot spare), and add it back to the RAID array. The RAID operations are all done using mdadm. After the RAID array has completed its rebuild, we run fsck on the RAID device. When we do that, fsck seems to run forever, i.e., for days at a time, occasionally spitting out messages about files with recognizable names, but never completing satisfactorily. It would be very interesting to try to replicate the fsck on a single SATA drive. Several vendors have 1000 Gbyte drives for less than I recall paying for a Bigfoot. If needs be, you could stripe two to get the size up, but a single drive eliminates the complexities of your current setup. I imaging if a desktop pc took days to fsck, someone would have noticed. -- Cheers John -- spambait [EMAIL PROTECTED] [EMAIL PROTECTED] -- Advice http://webfoot.com/advice/email.top.php http://www.catb.org/~esr/faqs/smart-questions.html http://support.microsoft.com/kb/555375 You cannot reply off-list:-)