Re: horribly slow fsck_ffs pass1 performance
On Sat, Apr 02, 2011 at 01:45:36PM -0500, Amit Kulkarni wrote: Hi, I am replying in a single email. I do a fsck once in a while, not regular. In the last 6-8 months I might have done it about 5 times. And I did it multi-user the few times I did it, but plan on doing it single user in future and I do plan to do it monthly. After seeing the messages when you fsck, it is better to do it monthly. FreeBSD which is the origin of FFS does a background fsck, and if Kirk McCusick feels so strongly I will do it too. (I remember somebody talking about having background fsck here on a openbsd list, but I forgot who it was). This is completely stupid. What do you trust more ? your file system, or fsck ? oth have bugs ! I'm sure of it ! so, if you run fsck, it's likely you're going to run into fsck bugs eventually (and trying fsck on a mounted partition was really, really stupid). Whereas, if you don't run fsck, you're going to run into fs bugs eventually. Now, consider this: the fs code is very heavily tested. People use it 24 hours a day, 365 days a year. Compared to THAT, the fsck code is very lightly tested. It's run only once in a while, when the power shuts down, or when you update your machines. What is more likely ? corrupting a perfectly sane filesystem by running fsck on it (which has MORE code paths to correct problems and is usually run on corrupted filesystems), or having an unseen bug in the fs code that affects only you and that fsck would be able to see ?
Re: horribly slow fsck_ffs pass1 performance
Now, consider this: the fs code is very heavily tested. People use it 24 hours a day, 365 days a year. Except on leap years, of course. Those years see even more real-life testing happening!
Re: horribly slow fsck_ffs pass1 performance
On Sun, Apr 10, 2011 at 11:27:41AM +, Miod Vallat wrote: Now, consider this: the fs code is very heavily tested. People use it 24 hours a day, 365 days a year. Except on leap years, of course. Those years see even more real-life testing happening! Good point. Maybe we should go to single user and run fsck in a loop on february 29th.
Re: horribly slow fsck_ffs pass1 performance
On Sun, 10 Apr 2011, Marc Espie wrote: This is completely stupid. What do you trust more ? your file system, or fsck ? oth have bugs ! I'm sure of it ! so, if you run fsck, it's likely you're going to run into fsck bugs eventually (and trying fsck on a mounted partition was really, really stupid). There is an optional fsck -n of (mostly) mounted filesystems in /etc/daily. In what cases is this automated check intended to be used? Regards, David
Re: horribly slow fsck_ffs pass1 performance
On Sun, Apr 10, 2011 at 02:40:09PM +0200, David Vasek wrote: On Sun, 10 Apr 2011, Marc Espie wrote: This is completely stupid. What do you trust more ? your file system, or fsck ? oth have bugs ! I'm sure of it ! so, if you run fsck, it's likely you're going to run into fsck bugs eventually (and trying fsck on a mounted partition was really, really stupid). There is an optional fsck -n of (mostly) mounted filesystems in /etc/daily. In what cases is this automated check intended to be used? It's code we got from BSD 4.4 lite2. This specific chunk was brought by millert@ in rev 1.17. I would venture that apart from file system hackers, or people with really flaky hardware, no-one should set that specific test...
Re: horribly slow fsck_ffs pass1 performance
On Tue, Apr 5, 2011 at 7:06 AM, Janne Johansson icepic...@gmail.com wrote: /forcefsck and /fastboot have nothing to do with that they are not even administered by the fs I wasn't trying to imply the filesystem is putting the files there, nor reading them. Rather, those two files show that since there is no way to mark known brokeness in a ext file system, we wrap it up in shell scripts that create and look for B those files in order to 'know' if the filesystems are broken or not and if fsck is in order sorry, but those files aren't for that purpose. they're just means of queuing fscks and never intended as a viable replacement for dirty flags conversely, openbsd has got /fastboot, yet aptly omits /forcefsck you could say the latter has been avoided because it's a chicken and egg problem, but keep in mind that it forces fsck on all filesystems with a fs_passno greater than zero, not just root, and that / is very unlikely to become corrupted because it's typically split and seldom written to
Re: horribly slow fsck_ffs pass1 performance
On Saturday, April 2, Amit Kulkarni wrote: FreeBSD which is the origin of FFS does a background fsck, and if Kirk McCusick feels so strongly I will do it too. FreeBSD was not the origin of the FFS code. Background fsck in freebsd is mainly meant to reduce the amount of time it takes to get to a useable system at boot (after an unclean shutdown). -Toby.
Re: horribly slow fsck_ffs pass1 performance
2011/4/2 Benny Lofgren bl-li...@lofgren.biz I've noticed that some (all?) linux systems do uncalled-for file system checks at boot if no check have been made recently, but I've never understood this practice. It must mean they don't trust their own file systems, I'm quite sure this comes from the fact that there are several ways for a ext file system to get errors (which in bash used to show up as input/output error when you try to reference the file) but the filesystem will not store the error condition anywhere, so if you make a clean shutdown, and reboot, the fsck will not know that a fsck in due, and skip over it, and for that whole session until the next reboot, the file is still as inaccessible as before. And since only root may write the magic file in the (broken) filesystem root, a normal user can not force the fsck either, unless he kills the power switch so the boot scripts know there was an unclean shutdown before, OR, reboot 147 times (or whatever the intervals may be) so the system does run the fsck at boot. I dont pretent to know the optimal solution for keeping track of hey, I just told the user his file is corrupt, I should ask for fsck on the next mount but even the early-80s amiga floppy file systems would have a global dirty flag so the OS would launch disk validator next time you inserted the disk and mounted the filesystem if you found out it had some kind of read/write error. Letting users run 1-146 reboot cycles without checking even when you know stuff is broke is horrid. And having a file inside the actual filesystem to indicate if this file isnt deleted it means something as an inverse flag really doesnt count (/fastboot or whatever) since if half your files disappear and that one went also, then its missing status would indicate everything is fine. -- To our sweethearts and wives. May they never meet. -- 19th century toast
Re: horribly slow fsck_ffs pass1 performance
On Sun, Apr 3, 2011 at 4:21 AM, Janne Johansson icepic...@gmail.com wrote: 2011/4/2 Benny Lofgren bl-li...@lofgren.biz I've noticed that some (all?) linux systems do uncalled-for file system checks at boot if no check have been made recently, but I've never understood this practice. It must mean they don't trust their own file systems, I'm quite sure this comes from the fact that there are several ways for a ext file system to get errors (which in bash used to show up as input/output error when you try to reference the file) but the filesystem will not store the error condition anywhere, so if you make a clean shutdown, and reboot, the fsck will not know that a fsck in due, and skip over it, and for that whole session until the next reboot, the file is still as inaccessible as before. And since only root may write the magic file in the (broken) filesystem root, a normal user can not force the fsck either, unless he kills the power switch so the boot scripts know there was an unclean shutdown before, OR, reboot 147 times (or whatever the intervals may be) so the system does run the fsck at boot. I dont pretent to know the optimal solution for keeping track of hey, I just told the user his file is corrupt, I should ask for fsck on the next mount but even the early-80s amiga floppy file systems would have a global dirty flag so the OS would launch disk validator next time you inserted the disk and mounted the filesystem if you found out it had some kind of read/write error. Letting users run 1-146 reboot cycles without checking even when you know stuff is broke is horrid. And having a file inside the actual filesystem to indicate if this file isnt deleted it means something as an inverse flag really doesnt count (/fastboot or whatever) since if half your files disappear and that one went also, then its missing status would indicate everything is fine. /forcefsck and /fastboot have nothing to do with that they are not even administered by the fs -- B To our sweethearts and wives. B May they never meet. -- 19th century toast
Re: horribly slow fsck_ffs pass1 performance
On 2011-04-01 21.48, Amit Kulkarni wrote: And jumping up and down after a first successful test is not a sound engineering principle either. [...stuff deleted...] It turns out that I had extracted into the default firefox download location (/home/amit/downloads I forgot exactly where) all kinds of files. There was sources for gdb 6.3, 6.6, 6.7, 6.8. GCC 3.3, 3.4.6, 4.5 etc, LLVM + Clang 2.8. Still more that I forget. This is a 20 Gb fs and I was totally unaware I was abusing my fs so much. The day this happened, I had updated src,ports,xenocara,www from cvs. I immediately did a plain fsck right after this operation. I typed fsck in the same window while it was updating ports. On hindsight, I might have waited till it had finished writing the cache to disk. The fsck proceeded well until it encountered the gazillion files in /home. I have to ask: From your description I get the impression that you're fsck'ing mounted file systems, and you seem to be doing this on a more or less regular basis? Why? Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: horribly slow fsck_ffs pass1 performance
On 2011-04-01 19.03, Amit Kulkarni wrote: Thank you Arthur and the team for a very fast turnaround! Thank you for reducing the pain. I will schedule a fsck every month or so, knowing it won't screw up anything and be done really quick. Why schedule fsck runs at all? The file system code is very mature and although of course it would be unwise to declare it bug free, I see very little reason to run fsck on a file system unless there have been some problem like an unclean shutdown to prompt it (in which case of course, the system does it for you automatically when rebooting). I've noticed that some (all?) linux systems do uncalled-for file system checks at boot if no check have been made recently, but I've never understood this practice. It must mean they don't trust their own file systems, which frankly I find a bit unsettling... I'd rather use a file system that's been field proven for decades than use something thats just come out of the experimenting shop. Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: horribly slow fsck_ffs pass1 performance
Hi, I am replying in a single email. I do a fsck once in a while, not regular. In the last 6-8 months I might have done it about 5 times. And I did it multi-user the few times I did it, but plan on doing it single user in future and I do plan to do it monthly. After seeing the messages when you fsck, it is better to do it monthly. FreeBSD which is the origin of FFS does a background fsck, and if Kirk McCusick feels so strongly I will do it too. (I remember somebody talking about having background fsck here on a openbsd list, but I forgot who it was). FS code in OpenBSD is mature and appears to be better than on FreeBSD. Linux has a problem with fsync() on ext3 (maybe even ext4), that is why they do it so often. I read that they go for more speed and pay less attention to data integrity. I was new to OpenBSD since about 6-8 months, so I will try it out. I don't have anything important on that OpenBSD machine, everything is backed up safely. Once I am fully satisfied I won't do it monthly, maybe less or most likely never. I will be experimenting with fsck since that new code change by Otto at least for the next few months. You guys know the limits and capabilities. So *you* don't, some others might or might not. But I am learning and wanting to be on a stable virus free, trojan free, crapware free machine. The choice for me is one of the BSD's. What is a new guy to know? Thanks, amit On Sat, Apr 2, 2011 at 10:46 AM, Benny Lofgren bl-li...@lofgren.biz wrote: On 2011-04-01 19.03, Amit Kulkarni wrote: Thank you Arthur and the team for a very fast turnaround! Thank you for reducing the pain. I will schedule a fsck every month or so, knowing it won't screw up anything and be done really quick. Why schedule fsck runs at all? The file system code is very mature and although of course it would be unwise to declare it bug free, I see very little reason to run fsck on a file system unless there have been some problem like an unclean shutdown to prompt it (in which case of course, the system does it for you automatically when rebooting). I've noticed that some (all?) linux systems do uncalled-for file system checks at boot if no check have been made recently, but I've never understood this practice. It must mean they don't trust their own file systems, which frankly I find a bit unsettling... I'd rather use a file system that's been field proven for decades than use something thats just come out of the experimenting shop. Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: horribly slow fsck_ffs pass1 performance
On 2011/04/02 13:45, Amit Kulkarni wrote: I do a fsck once in a while, not regular. In the last 6-8 months I might have done it about 5 times. And I did it multi-user the few times I did it, but plan on doing it single user in future and I do plan to do it monthly. After seeing the messages when you fsck, it is better to do it monthly. FreeBSD which is the origin of FFS does a ^^ hmm? background fsck, and if Kirk McCusick feels so strongly I will do it too. (I remember somebody talking about having background fsck here on a openbsd list, but I forgot who it was). the background fsck there isn't done every time, just after unclean shutdown (and a right pain it was too last time I experienced it, fsck can use a lot of RAM ...)
Re: horribly slow fsck_ffs pass1 performance
Op 31 mrt. 2011 om 22:25 heeft Otto Moerbeek o...@drijf.net het volgende geschreven: On Thu, Mar 31, 2011 at 10:14:46PM +0200, Otto Moerbeek wrote: So here's an initial, only lightly tested diff. Beware, this very well could eat your filesystems. To note any difference, you should use the -p mode of fsck_ffs (rc does that) and the fs should have been mounted with softdep. I now realize speedup will also be there for non -p usage in quite a few cases. So that explains the speed differences seen without -p. But the reported original 4 hours likely means the system is swapping. I have seen very nice speedups already. But don't count yourself a rich man too soon: for ffs2 filesystesm, you won't see a lot of speedup, because inode blocks are allocated on-demand there, so a filesystem with few inodes used likely has few inode blocks. Also, depending on the usage patterns, you might have a fs where high numbered inodes are used, while the fs itself is pretty empty. Filling up a fs with lots of files and them removing a lot of them is an example that could lead to such a situation. This diff does not speed things up in such cases. -Otto
Re: horribly slow fsck_ffs pass1 performance
Hi Otto, fsck -p is not possible to do in multi-user because of # fsck -p /extra NO WRITE ACCESS /dev/rwd0m: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY. I haven't checked but it probably wants to do it single user when all fs are unmounted. And it would work when fs are unclean shutdown. I applied art@ diff and the exact same partition (which I newfs'd with the original defaults -b 16K -f 2K), went through fsck within 1 minute (I copied original /sbin/fsck_ffs to /sbin/ofsck_ffs). I have enabled bigmem, and his diff is absolutely needed for fast fsck. Thank you Arthur and the team for a very fast turnaround! Thank you for reducing the pain. I will schedule a fsck every month or so, knowing it won't screw up anything and be done really quick. So with the information presented in this thread, fsck shouldn't be a problem for anybody anymore. That is, increasing data block size (say for Postgres or for Virtual images or family videos) and checking only used inodes. Thanks, amit Please tell us more why -p does not work. What happens if you try it. Be more exact in your description. So why is you system extra slow? Maybe it has too little memory and starts swapping. Some details might come in handy.
Re: horribly slow fsck_ffs pass1 performance
fOn Fri, Apr 01, 2011 at 12:03:19PM -0500, Amit Kulkarni wrote: Hi Otto, fsck -p is not possible to do in multi-user because of # fsck -p /extra NO WRITE ACCESS /dev/rwd0m: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY. Of course. What's the point of checking a mounted filesystem. I haven't checked but it probably wants to do it single user when all fs are unmounted. And it would work when fs are unclean shutdown. I applied art@ diff and the exact same partition (which I newfs'd with the original defaults -b 16K -f 2K), went through fsck within 1 minute (I copied original /sbin/fsck_ffs to /sbin/ofsck_ffs). I have enabled bigmem, and his diff is absolutely needed for fast fsck. Thank you Arthur and the team for a very fast turnaround! Thank you for reducing the pain. I will schedule a fsck every month or so, knowing it won't screw up anything and be done really quick. So with the information presented in this thread, fsck shouldn't be a problem for anybody anymore. That is, increasing data block size (say for Postgres or for Virtual images or family videos) and checking only used inodes. Thanks, amit I should say thanks for pointing at the optimization. What I don't like is that you never have given details (even when requested) on your extremely slow original fsck which started this thread. The last couple of years I tested fsck on many different setups, but I never saw fsck times of 4 hours and not even finished. So there's something special about your setup. It's likely that bigmem plays a role, but you only mention it now. That's not the way to do proper problem anlysis. And jumping up and down after a first successful test is not a sound engineering principle either. -Otto
Re: horribly slow fsck_ffs pass1 performance
What I don't like is that you never have given details (even when requested) on your extremely slow original fsck which started this thread. The last couple of years I tested fsck on many different setups, but I never saw fsck times of 4 hours and not even finished. So there's something special about your setup. It's likely that bigmem plays a role, but you only mention it now. That's not the way to do proper problem anlysis. And jumping up and down after a first successful test is not a sound engineering principle either. -Otto Otto, I am sorry that I overlooked giving any details. Here goes. Ok here goes, my fs layout is as follows. Some fs are now a mix of newfs with -b 64K -f 8k and some with -b 16k -f 2k (I changed /tmp /usr/obj /usr/xobj and /personal over to the bumped up newfs values). I will change them all back to the default values over this weekend, I prefer defaults. # df -h Filesystem SizeUsed Avail Capacity Mounted on /dev/wd0a 1005M 73.4M881M 8%/ /dev/wd0d 4.0G 40.0K3.8G 0%/tmp /dev/wd0e 9.3G 16.9M8.9G 0%/var /dev/wd0f 5.9G1.6G4.0G29%/usr /dev/wd0g 1008M169M789M18%/usr/X11R6 /dev/wd0h 11.8G1.3G9.9G12%/usr/local /dev/wd0i 5.9G770M4.9G13%/usr/src /dev/wd0j 4.0G1.5M3.8G 0%/usr/obj /dev/wd0k 4.0G8.0K3.8G 0%/usr/xobj /dev/wd0l 19.7G110M 18.6G 1%/home /dev/wd0m 39.4G1.3G 36.1G 4%/extra /dev/wd0n 39.8G9.2G 28.6G24%/personal /dev/wd0o 81.1G 94.0M 76.9G 0%/downloads /dev/sd0a 231G 13.3G206G 6%/datamir The original problem I faced was here http://marc.info/?l=openbsd-miscm=129900971428196w=2 I had turned on bigmem slightly just before this debacle happened. It turns out that I had extracted into the default firefox download location (/home/amit/downloads I forgot exactly where) all kinds of files. There was sources for gdb 6.3, 6.6, 6.7, 6.8. GCC 3.3, 3.4.6, 4.5 etc, LLVM + Clang 2.8. Still more that I forget. This is a 20 Gb fs and I was totally unaware I was abusing my fs so much. The day this happened, I had updated src,ports,xenocara,www from cvs. I immediately did a plain fsck right after this operation. I typed fsck in the same window while it was updating ports. On hindsight, I might have waited till it had finished writing the cache to disk. The fsck proceeded well until it encountered the gazillion files in /home. Being naive I expected it to complete in 1 hr at the most. Here I am staring at the screen, and the machine is completely unresponsive. A keypress is taking a long time. I found out during this unfortunate time that OpenBSD kernel is not pre-emptive, I/O goes on uninterrupted. I did a pkill, kill -9, to no avail. I tried logging into virtual terminal 2, before giving up after 4-5 hours. I was reading the FAQ and googling which said fsck needs more memory. I didn't think it would apply to my case. I didn't even think that bigmem had anything to do with fsck. This machine is 8GB RAM with 2 X dual core Opteron. So there is no memory issue... The machine went into heavy I/O load, I could tell that much. Hard disk spinning like crazy. (Btw, now I know how to do a cleaner shutdown while hitting the power button on OpenBSD) So next day or so, I went into single user and marked all fs clean except /home. After some more time in fsck, it struck me that I might have lots of files, so again a power cycle and then marking /home clean and then rm -rf. I learnt so many things here due to this experience...such as you have to let softupdates settle for 30 seconds after heavy I/O before it flushes its cache to disk. Anyway, after rm -rf in /home. I experimented with fsck'ing the / fs, it was quick. Then experimented with fsck of /usr/xobj , /usr/obj basically in increasing order of fs use to find where fsck was hanging. It still was taking time and it was reasonable till 2-4G, then it just went crazy. For kicks I did a fsck on a huge unused partition of 160 Gb and it also was taking time. I made sure there was nothing there in that huge 160GB partition, did a rm -rf. I trimmed the /home to what you see now, and shifted my downloads from Chrome/Firefox to /downloads. The huge partition of 160 GB which was unused, just as recommended in FAQ, I broke it up into three /downloads /personal /extra. I removed XFCE, Gnome to cut the fat and went back to default FVWM. And from then I made the statement that above the 12GB /usr/local it was crazy to contemplate doing fsck, because it just wouldn't proceed. This story is from memory so it will be inconsistent. Anyway art@ fixed it. And I guess if I ran fsck on a heavily loaded fs now with his fix, fsck would run much faster. And with your fixes it will be blindingly fast. Any help needed in testing I am willing to do it, so that it would help get the
Re: horribly slow fsck_ffs pass1 performance
On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: Hi, In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). fsck main limitation is in pass1.c. In pass1.c I found out that it in fact proceeded to check all inodes, but there's a misleading comment there, which says, Find all allocated blocks. So the original intent was to check only used inodes in that code block but somebody deleted that part of code when compared to FreeBSD. Is there any special reason not to build a used inode list, then only go through it as FreeBSD does? I know they added some stuff in the last year but that part of code has existed for a long time and we don't have it. Why not? I was reading cvs ver 1.46 of pass1.c in FreeBSD. Thanks AFAIK, we never had that optimization. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. -Otto
Re: horribly slow fsck_ffs pass1 performance
On 2011-03-31, Otto Moerbeek o...@drijf.net wrote: On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density.
Re: horribly slow fsck_ffs pass1 performance
On 2011-03-31 11.13, Stuart Henderson wrote: On 2011-03-31, Otto Moerbeek o...@drijf.net wrote: On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density. I think this is a very good idea regardless. I often forget to manually tune large file systems, and end up with some ridiculously skewed resource allocations. For example, this is what one of my file systems looks like right now: skynet:~# df -ih /u0 Filesystem SizeUsed Avail Capacity iused ifree %iused Mounted on /dev/raid1a 12.6T7.0T5.5T56% 881220 211866810 0% /u0 This one takes about an hour to fsck. In general, the default values and algorithms for allocations could probably do with a tune-up, since of course today's disks are several magnitudes larger than only a few years ago (let alone than those that were around when the bulk of the file system code was written!), and the usage patterns are also in my experience often wildly different in a large file system than in a smaller one. I guess an fs like the one above would benefit a lot from the optimization the OP mentions. Perhaps it could be optional, since Otto mentions that it makes assumptions on correctness of the cylinder group summary info. I haven't looked at the code in a while, so I can't really judge the consequences of that, or if some middle ground can be reached where the CG info is sanity checked without the need for a full scan through every inode. Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 09:13:41AM +, Stuart Henderson wrote: On 2011-03-31, Otto Moerbeek o...@drijf.net wrote: On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). disklabel has code already to move to larger block and frag sizes for large (new) partitions. newfs picks these settings up. Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density. -Otto
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 12:30:29PM +0200, Benny Lofgren wrote: On 2011-03-31 11.13, Stuart Henderson wrote: On 2011-03-31, Otto Moerbeek o...@drijf.net wrote: On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density. I think this is a very good idea regardless. I often forget to manually tune large file systems, and end up with some ridiculously skewed resource allocations. For example, this is what one of my file systems looks like right now: skynet:~# df -ih /u0 Filesystem SizeUsed Avail Capacity iused ifree %iused Mounted on /dev/raid1a 12.6T7.0T5.5T56% 881220 211866810 0% /u0 This one takes about an hour to fsck. In general, the default values and algorithms for allocations could probably do with a tune-up, since of course today's disks are several magnitudes larger than only a few years ago (let alone than those that were around when the bulk of the file system code was written!), and the usage patterns are also in my experience often wildly different in a large file system than in a smaller one. We do that already, inode density will be lower for newly created partitions, because diskalbel sets larger block and fragment sizes. -Otto I guess an fs like the one above would benefit a lot from the optimization the OP mentions. Perhaps it could be optional, since Otto mentions that it makes assumptions on correctness of the cylinder group summary info. I haven't looked at the code in a while, so I can't really judge the consequences of that, or if some middle ground can be reached where the CG info is sanity checked without the need for a full scan through every inode. Regards, /Benny -- internetlabbet.se / work: +46 8 551 124 80 / Words must Benny Lvfgren/ mobile: +46 70 718 11 90 / be weighed, / fax:+46 8 551 124 89/not counted. /email: benny -at- internetlabbet.se
Re: horribly slow fsck_ffs pass1 performance
On 2011/03/31 12:46, Otto Moerbeek wrote: In general, the default values and algorithms for allocations could probably do with a tune-up, since of course today's disks are several magnitudes larger than only a few years ago (let alone than those that were around when the bulk of the file system code was written!), and the usage patterns are also in my experience often wildly different in a large file system than in a smaller one. We do that already, inode density will be lower for newly created partitions, because diskalbel sets larger block and fragment sizes. Ah, the manual is out-of-date. Index: newfs.8 === RCS file: /cvs/src/sbin/newfs/newfs.8,v retrieving revision 1.68 diff -u -p -r1.68 newfs.8 --- newfs.8 21 Mar 2010 07:51:23 - 1.68 +++ newfs.8 31 Mar 2011 11:10:18 - @@ -169,7 +169,7 @@ The expected average file size for the f The expected average number of files per directory on the file system. .It Fl i Ar bytes This specifies the density of inodes in the file system. -The default is to create an inode for each 8192 bytes of data space. +The default is to create an inode for every 4 fragments. If fewer inodes are desired, a larger number should be used; to create more inodes a smaller number should be given. .It Fl m Ar free-space
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 12:30:29PM +0200, Benny Lofgren wrote: For example, this is what one of my file systems looks like right now: skynet:~# df -ih /u0 Filesystem SizeUsed Avail Capacity iused ifree %iused Mounted on /dev/raid1a 12.6T7.0T5.5T56% 881220 211866810 0% /u0 This one takes about an hour to fsck. The change discussed won't help you much here, since ffs2 filesytems already only initializes inodeblocks actually used. Memory use will be reduced, however, which might be even more worthwhile. -Otto
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 09:13:41AM +, Stuart Henderson wrote: On 2011-03-31, Otto Moerbeek o...@drijf.net wrote: On Wed, Mar 30, 2011 at 03:45:02PM -0500, Amit Kulkarni wrote: In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). So this helps a lot to reduce fsck however if you play a lot with the tuning parameters the only thing you tune is less speed. I played quite a bit with the parameters and the results were always worse than the defaults. Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density.
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 02:50:36PM -0500, Amit Kulkarni wrote: If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). Stuart, Thanks for the tip. But I can verify when I did lookup my 80G filesystem it is currently not specifying -i, so it is 8Kb per a single inode (it is 4 times frag size per your update to newfs man page). This is a no brainer optimization which can get huge wins in fsck immediately without too much change in the existing code. I dont think we want to change thed default density. Larger parttitions already gets larger blocks and fragment, and as a consequence lower number of inodes. Otto, In my tests on AMD64, if FFS partition size increases beyond 30GB, fsck starts taking exponential time even if you have zero used inodes. This is a for i () for j() loop and if you reduce the for j() inner loop it is a win. Yes, it becomes very slow, but I don't think it is exponential. dumpfs -m /downloads # newfs command for /dev/wd0o newfs -O 1 -b 16384 -e 4096 -f 2048 -g 16384 -h 64 -m 5 -o time -s 172714816 /dev/wd0o So, if I read it correctly, setting just the block size higher to say 64Kb does auto tune frag size to 1/8 which is 8Kb (newfs complains appropriately) but the auto tune inode length to 4 times frag which is 32Kb is not implemented now? Is this the proposed formula? There's no such thing as inode length. If a user tunes -i inodes, or -f frags or -b block size, it should all auto-adjust to the same outcome based on above formula in the future? I don't see any formula. If you feel you have too many inodes, you can use a larger -i, -b and or -f For newly created partitions, newfs will pickup larger -b and -f from the disklabel entry. If you still want less inodes, increase -f, -b or -i further. dumpfs doesn't show the total inodes or the inode length in a easily readable format (-m option). Just trying to understand what the acronyms mean. You want toal inodes = ng * ipg (number of cylinder groups * inode per group) in the dumpfs header. I have no idea what you mean by inode length. -Otto
Re: horribly slow fsck_ffs pass1 performance
So here's an initial, only lightly tested diff. Beware, this very well could eat your filesystems. To note any difference, you should use the -p mode of fsck_ffs (rc does that) and the fs should have been mounted with softdep. I have seen very nice speedups already. -Otto Index: dir.c === RCS file: /cvs/src/sbin/fsck_ffs/dir.c,v retrieving revision 1.24 diff -u -p -r1.24 dir.c --- dir.c 27 Oct 2009 23:59:32 - 1.24 +++ dir.c 31 Mar 2011 08:30:36 - @@ -443,8 +443,8 @@ linkup(ino_t orphan, ino_t parentdir) idesc.id_type = ADDR; idesc.id_func = pass4check; idesc.id_number = oldlfdir; - adjust(idesc, lncntp[oldlfdir] + 1); - lncntp[oldlfdir] = 0; + adjust(idesc, ILNCOUNT(oldlfdir) + 1); + ILNCOUNT(oldlfdir) = 0; dp = ginode(lfdir); } if (GET_ISTATE(lfdir) != DFOUND) { @@ -457,7 +457,7 @@ linkup(ino_t orphan, ino_t parentdir) printf(\n\n); return (0); } - lncntp[orphan]--; + ILNCOUNT(orphan)--; if (lostdir) { if ((changeino(orphan, .., lfdir) ALTERED) == 0 parentdir != (ino_t)-1) @@ -465,7 +465,7 @@ linkup(ino_t orphan, ino_t parentdir) dp = ginode(lfdir); DIP_SET(dp, di_nlink, DIP(dp, di_nlink) + 1); inodirty(); - lncntp[lfdir]++; + ILNCOUNT(lfdir)++; pwarn(DIR I=%u CONNECTED. , orphan); if (parentdir != (ino_t)-1) { printf(PARENT WAS I=%u\n, parentdir); @@ -476,7 +476,7 @@ linkup(ino_t orphan, ino_t parentdir) * fixes the parent link count so that fsck does * not need to be rerun. */ - lncntp[parentdir]++; + ILNCOUNT(parentdir)++; } if (preen == 0) printf(\n); @@ -636,7 +636,7 @@ allocdir(ino_t parent, ino_t request, in DIP_SET(dp, di_nlink, 2); inodirty(); if (ino == ROOTINO) { - lncntp[ino] = DIP(dp, di_nlink); + ILNCOUNT(ino) = DIP(dp, di_nlink); cacheino(dp, ino); return(ino); } @@ -650,8 +650,8 @@ allocdir(ino_t parent, ino_t request, in inp-i_dotdot = parent; SET_ISTATE(ino, GET_ISTATE(parent)); if (GET_ISTATE(ino) == DSTATE) { - lncntp[ino] = DIP(dp, di_nlink); - lncntp[parent]++; + ILNCOUNT(ino) = DIP(dp, di_nlink); + ILNCOUNT(parent)++; } dp = ginode(parent); DIP_SET(dp, di_nlink, DIP(dp, di_nlink) + 1); Index: extern.h === RCS file: /cvs/src/sbin/fsck_ffs/extern.h,v retrieving revision 1.10 diff -u -p -r1.10 extern.h --- extern.h25 Jun 2007 19:59:55 - 1.10 +++ extern.h31 Mar 2011 11:56:53 - @@ -54,6 +54,7 @@ int ftypeok(union dinode *); void getpathname(char *, size_t, ino_t, ino_t); void inocleanup(void); void inodirty(void); +struct inostat *inoinfo(ino_t); intlinkup(ino_t, ino_t); intmakeentry(ino_t, ino_t, char *); void pass1(void); Index: fsck.h === RCS file: /cvs/src/sbin/fsck_ffs/fsck.h,v retrieving revision 1.23 diff -u -p -r1.23 fsck.h --- fsck.h 10 Jun 2008 23:10:29 - 1.23 +++ fsck.h 31 Mar 2011 11:55:42 - @@ -66,6 +66,19 @@ union dinode { #define BUFSIZ 1024 #endif +/* + * Each inode on the file system is described by the following structure. + * The linkcnt is initially set to the value in the inode. Each time it + * is found during the descent in passes 2, 3, and 4 the count is + * decremented. Any inodes whose count is non-zero after pass 4 needs to + * have its link count adjusted by the value remaining in ino_linkcnt. + */ +struct inostat { + charino_state; /* state of inode, see below */ + charino_type; /* type of inode */ + short ino_linkcnt;/* number of links not found */ +}; + #defineUSTATE 01 /* inode not allocated */ #defineFSTATE 02 /* inode is file */ #defineDSTATE 03 /* inode is directory */ @@ -73,12 +86,20 @@ union dinode { #defineDCLEAR 05 /* directory is to be cleared */ #defineFCLEAR 06 /* file is to be cleared */ -#define GET_ISTATE(ino)(stmap[(ino)] 0xf) -#define GET_ITYPE(ino) (stmap[(ino)] 4) -#define SET_ISTATE(ino, v) do { stmap[(ino)] = (stmap[(ino)] 0xf0) | \ - ((v) 0xf); } while (0) -#define SET_ITYPE(ino, v) do {
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 10:14:46PM +0200, Otto Moerbeek wrote: So here's an initial, only lightly tested diff. Beware, this very well could eat your filesystems. To note any difference, you should use the -p mode of fsck_ffs (rc does that) and the fs should have been mounted with softdep. I have seen very nice speedups already. But don't count yourself a rich man too soon: for ffs2 filesystesm, you won't see a lot of speedup, because inode blocks are allocated on-demand there, so a filesystem with few inodes used likely has few inode blocks. Also, depending on the usage patterns, you might have a fs where high numbered inodes are used, while the fs itself is pretty empty. Filling up a fs with lots of files and them removing a lot of them is an example that could lead to such a situation. This diff does not speed things up in such cases. -Otto
Re: horribly slow fsck_ffs pass1 performance
On Thu, Mar 31, 2011 at 10:12:07PM +0200, Otto Moerbeek wrote: So, if I read it correctly, setting just the block size higher to say 64Kb does auto tune frag size to 1/8 which is 8Kb (newfs complains appropriately) but the auto tune inode length to 4 times frag which is 32Kb is not implemented now? Is this the proposed formula? There's no such thing as inode length. If a user tunes -i inodes, or -f frags or -b block size, it should all auto-adjust to the same outcome based on above formula in the future? I don't see any formula. Ah, now I understand what yoy mean by formula. The rule is: if no -i parameter is given it's value is computed by 4 * fragment size. Default values for -b and -f are taken from the disklabel. disklabel(8) in -E modes fills them in based on fs partition size. If you specify -f or -b with newfs, these values override the values in the label, and the label will be updated after the newfs. So the next time you do a newfs, you'll re-use the last values for -b and -f. -Otto
Re: horribly slow fsck_ffs pass1 performance
If you really have a lot of used inodes, skipping the unused ones isn't going to help :-) You could always build your large-sized filesystems with a larger value of bytes-per-inode. newfs -i 32768 or 65536 is good for common filesystem use patterns with larger partitions (for specialist uses e.g. storing backups as huge single files it might be appropriate to go even higher). Stuart, Thanks for the tip. But I can verify when I did lookup my 80G filesystem it is currently not specifying -i, so it is 8Kb per a single inode (it is 4 times frag size per your update to newfs man page). This is a no brainer optimization which can get huge wins in fsck immediately without too much change in the existing code. Otto, In my tests on AMD64, if FFS partition size increases beyond 30GB, fsck starts taking exponential time even if you have zero used inodes. This is a for i () for j() loop and if you reduce the for j() inner loop it is a win. dumpfs -m /downloads # newfs command for /dev/wd0o newfs -O 1 -b 16384 -e 4096 -f 2048 -g 16384 -h 64 -m 5 -o time -s 172714816 /dev/wd0o So, if I read it correctly, setting just the block size higher to say 64Kb does auto tune frag size to 1/8 which is 8Kb (newfs complains appropriately) but the auto tune inode length to 4 times frag which is 32Kb is not implemented now? Is this the proposed formula? If a user tunes -i inodes, or -f frags or -b block size, it should all auto-adjust to the same outcome based on above formula in the future? dumpfs doesn't show the total inodes or the inode length in a easily readable format (-m option). Just trying to understand what the acronyms mean. Thanks disklabel has code already to move to larger block and frag sizes for large (new) partitions. newfs picks these settings up. Of course this does involve dump/restore if you need to do this for an existing filesystem. It is interesting because it really speeds up fsck_ffs for filesystems with few used inodes. There's also a dangerous part: it assumes the cylinder group summary info is ok when softdeps has been used. I suppose that's the reason why it was never included into OpenBSD. I'll ponder if I want to work on this. A safer alternative to this optimization might be for the installer (or newfs) to consider the fs size when deciding on a default inode density. -Otto
Re: horribly slow fsck_ffs pass1 performance
I dont think we want to change thed default density. Larger parttitions already gets larger blocks and fragment, and as a consequence lower number of inodes. Otto, In my tests on AMD64, if FFS partition size increases beyond 30GB, fsck starts taking exponential time even if you have zero used inodes. This is a for i () for j() loop and if you reduce the for j() inner loop it is a win. Yes, it becomes very slow, but I don't think it is exponential. Wo, even with ***existing code*** because I did a newfs -b 65536 -f 8192 wd0m (this has an implicit -i 32768) fsck chewed through a 80G partition with 2 clang static analyzer runs (2100 files of 200 Kb each) within 1 minute. When before this, it never went past pass1 for over 5 hours. Insanely fast fsck runs. Thanks Stuart and Otto. Why don't you make the newfs default? What does everybody say? newfs -b 65536 -f 8192 -i 32768 Somebody ought to change the section in FAQ too.!! I will try out your diff right now. dumpfs -m /downloads # newfs command for /dev/wd0o newfs -O 1 -b 16384 -e 4096 -f 2048 -g 16384 -h 64 -m 5 -o time -s 172714816 /dev/wd0o So, if I read it correctly, setting just the block size higher to say 64Kb does auto tune frag size to 1/8 which is 8Kb (newfs complains appropriately) but the auto tune inode length to 4 times frag which is 32Kb is not implemented now? Is this the proposed formula? There's no such thing as inode length. Sorry what I meant was the size required to consider storing a single inode?
horribly slow fsck_ffs pass1 performance
Hi, In fsck_ffs's pass1.c it just takes forever for large sized partitions and also if you have very high number of files stored on that partition (used inodes count goes high). fsck main limitation is in pass1.c. In pass1.c I found out that it in fact proceeded to check all inodes, but there's a misleading comment there, which says, Find all allocated blocks. So the original intent was to check only used inodes in that code block but somebody deleted that part of code when compared to FreeBSD. Is there any special reason not to build a used inode list, then only go through it as FreeBSD does? I know they added some stuff in the last year but that part of code has existed for a long time and we don't have it. Why not? I was reading cvs ver 1.46 of pass1.c in FreeBSD. Thanks