February 26th, 2009 in Cooking, Filesystems, Gadgets, Linux, OSS

Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original allocator was actually a bit more simple-minded than ext3’s, in that it didn’t implement the Orlov algorithm to spread out top-level directories for better filesystem aging. It also was buggy in certain ways, where it would return ENOSPC even when there were still plenty of inodes in the file system.

So I had been working on extending ext3’s original Orlov allocator so it would work well with ext4. While I was at it, it occurred to me that one of the tricks I could play with ext4’s flex groups (which are higher-order collection of block groups), was to bias the block allocation algorithms such that the first block group in a flexgroup would be preferred for use by directories, and biased against data blocks for regular files. This meant that directory blocks would get clustered together, which cut a third off the time needed for e2fsck pass2:

Comparison of e2fsck times on an 32GB partition
Pass	ext4 old allocator					ext4 new allocator
	time (s)			I/O		time (s)			I/O
	real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	6.69	4.06	0.90	82	12.25	6.70	3.63	1.58	82	12.23
2	13.34	2.30	3.78	133	9.97	4.24	1.27	2.46	133	31.36
3	0.02	0.01	0	1	63.85	0.01	0.01	0.01	1	82.69
4	0.28	0.27	0	0	0	0.23	0.22	0	0	0
5	2.60	2.31	0.03	1	0.38	2.42	2.15	0.07	1	0.41
Total	23.06	9.03	4.74	216	9.37	13.78	7.33	4.19	216	15.68

As you may recall from my previous observations on this blog, although we hadn’t been explicitly engineering for this, a file system consistency check on an ext4 file system tends to be a factor of 6-8 faster than the e2fsck times on an equivalent ext3 file system, mainly due to the elimination of indirect blocks and the uninit_bg feature reducing the amount of disk reads necessary in e2fsck’s pass 1. However, the ext4 layout optimizations didn’t do much for e2fsck’s pass 2. Well, the optimization of the block and inode allocators is complementary to the original ext4 fsck improvements, since it focuses on what we hadn’t optimized the first time around: e2fsck pass 2 times have been cut by a third, and the overall fsck time has been cut by 40%. Not too shabby!

Of course, we need to do more testing to make sure we haven’t caused other file system benchmarks to degrade, although I’m cautiously optimistic that this will end up being a net win. I suspect that some benchmarks will go up by a little, and others will go down a little, depending on how heavily the benchmark exercises directory operations versus sequential I/O patterns. If people want to test this new allocator, it is in the ext4 patch queue. If all goes well, I will hopefully be pushing it to Linus after 2.6.29 is released, at the next merge window.

horizontal separator

For comparison’s sake, here is a comparison of the fsck time of the same collection of files and directories, comparing ext3 and the original ext4 block and inode allocator. The file system in question is a 32GB install of Ubuntu Jaunty, with a personal home directory, a rather large Maildir directory, some linux kernel trees, and an e2fsprogs tree. It’s basically the emergency environment I set up on my Netbook at FOSDEM.

In all cases the file systems were freshly copied from the original root directory using the command rsync -axH / /mnt. It’s actually a bit surprising to me that ext3’s pass 2 e2fsck times was that much better than e2fsck time under the old ext4 allocator. My previous experience has shown that the two are normally about the same, with a write throughput of around 9-10 MB/s on for e2fsck’s pass 2 for both ext3 file systems and ext4 file systems with the original inode/block allocators. Hence, I would have expected ext3’s pass2 time to have been 12-13 seconds, and not 6. I’m not sure how that happened, unless it was the luck of draw in terms of how things ended up getting allocated on disk. So I’m not too sure what happened there, but overall things look quite good for ext4 and fsck times!

Comparison of e2fsck times on an 32GB partition
Pass	ext3					ext4 old allocator
	time (s)			I/O		time (s)			I/O
	real	user	system	MB read	MB/s	real	user	system	MB read	MB/s
1	108.40	13.74	11.53	583	5.38	6.69	4.06	0.90	82	12.25
2	5.91	1.74	2.56	133	22.51	13.34	2.30	3.78	133	9.97
3	0.03	0.01	0	1	31.21	0.02	0.01	0	1	63.85
4	0.28	0.27	0	0	0	0.28	0.27	0	0	0
5	3.17	0.92	0.13	2	0.63	2.60	2.31	0.03	1	0.38
Total	118.15	16.75	14.25	718	6.08	23.06	9.03	4.74	216	9.37

Vital Statistics of the 32GB partition
312214	inodes used (14.89%)
263	non-contiguous files (0.1%)
198	non-contiguous directories (0.1%)
	# of inodes with ind/dind/tind blocks: 0/0/0
	Extent depth histogram: 292698/40
4388697	blocks used (52.32%)
0	bad blocks
1	large file

263549	regular files
28022	directories
5	character device files
1	block device file
5	fifos
615	links
20618	symbolic links (19450 fast symbolic links)
5	sockets
312820	files

Share and Enjoy:

Related posts (automatically generated):

Fast ext4 fsck times This wasn’t one of the things we were explicitly engineering for when were designing the features that would go into...
Wanted: Incremental Backup Solutions that Use a Database Dear Lazyweb, I’m looking for recommendations for Open Source backup solutions which track incremental backups using a database, and which...
Ext4 is now the primary filesystem on my laptop Over the weekend, I converted my laptop to use the ext4 filesystem. So far so good! So far I’ve found...
SSD’s, Journaling, and noatime/relatime On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s)...

Trackback URI | Comments RSS

6 Responses to “ Fast ext4 fsck times, revisited ”

# 1 GoblinX Project » GoblinX Newsletter, Issue 189 (03/01/2009) Says:
March 1st, 2009 at 8:04 am
[...] Fast ext4 fsck times, revisited [...]
# 2 Ionut Says:
March 4th, 2009 at 1:34 pm
i just read your post and i want to congrats you for your great job.
i do have a little question and i want to inform you that i’m pretty new at this things.
if these new allocator will be merged soon and i use ext4 on my partitions, after do i change my kernel version do i have do to something to use the new allocator instead of the old one?
# 3 tytso Says:
March 4th, 2009 at 6:41 pm
@2: Ionut,

Unless some major problems are found with it, it will probably be merged at the next merge window (i.e., after 2.6.29 is released). At this point my plans are to make it the default allocator, so no, you won’t have to do anything special once you are booting a kernel that has the new allocator merged.

Of course, to get the most value out of the allocator, you’ll need to do a backup/reformat/restore pass, so that the directory blocks are concentrated together, etc. But it shouldn’t do any harm to use the new allocator on an existing ext4 or ext3 filesystem; you just won’t see all of the benefits of the new allocator.
# 4 Anon Says:
March 11th, 2009 at 1:56 pm
(OFFTOPIC)

Do you reckon you have time to bash out a “safe file writing article”? It looks like https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781 is getting a lot of traffic…
# 5 Anon Says:
March 12th, 2009 at 6:25 am
(Darn it I was too slow in my request and the cloud of uncertainty has been spread far and wide. Sorry Ted. Feel free to delete this and the previous comment)
# 6 VP Says:
March 13th, 2009 at 11:51 am
For what it’s worth, JkDefrag[1], a defragmenter for Windows, groups all directories at the start of the disk. Apparently, the start of the disk is (a bit to significantly) faster than the end, and since directories are by far the most accessed files, it makes sense to put ‘em there. I suppose a similar approach might be a good idea for ext4, if you’re going to group ‘em all anyway.

[1]http://www.kessels.com/Jkdefrag/

[linuxkernelnewbies] Fast ext4 fsck times, revisited | Thoughts by Ted

Fast ext4 fsck times, revisited

6 Responses to “ Fast ext4 fsck times, revisited ”

Reply via email to