Online defragmentation for ext4

[LWN subscriber-only content]

By Jonathan Corbet
February 4, 2009

Any filesystem designed for use with rotating media must pay careful attention to the layout of files on the disk. If a file's blocks can be placed sequentially on the device, they can be read or written as a unit, without the need for performance-destroying head seeks in the middle. Even the most careful filesystem will sometimes fail to lay out files in a minimal number of contiguous extents, though. If a file grows, for example, and the blocks just past the previous end are not available, the filesystem has no choice other than placing the new blocks somewhere else. Depending on how full the filesystem is, those blocks could end up far away indeed. This sort of fragmentation can result in filesystems slowing down over time.

Fragmentation problems can be fixed up after the fact. The most obvious way to defragment a disk is to make a new filesystem on it; after all, empty filesystems tend not to have fragmentation problems. But the new filesystem will have less fragmentation even after its old contents have been restored onto it. When the ultimate size of every file is known in advance, it's relatively easy to make good layout decisions. Knowing this, system administrators have used backup-and-restore cycles as a way of cleaning up overly fragmented disks for many years.

There is, of course, a problem with this approach which goes beyond the risk of discovering that one's backup is not quite as good as one had thought. The downtime associated with rewriting a disk can be unwelcome to users; a filesystem which is down responds even more slowly than a filesystem with fragmentation problems. So it would be nice to have a way to defragment a filesystem while keeping it online and available. This online defragmentation capability has been on the ext4 "planned features" list for a long time; it is, at this point, about the only planned feature which has not yet been merged into the mainline.

Some attempts at online defragmentation have been made in the past, but they have not, yet, gotten through review. Now Akira Fujita has come forward with a new ext4 online defragmentation patch which, by virtue of a different view of the problem, might just make it into the mainline. Previous attempts exposed an interface whereby a user-space application could ask the filesystem to defragment a specific file by allocating new (contiguous) blocks to it. That turned out to be a bit too much work to put into the kernel; so, with this patch, Akira has created an interface which moves a bit more of the work into user space.

In the new scheme, a user-space defragmentation daemon will pick a file which, in its opinion, is too spread out on the disk. The daemon will then set about creating a new, less-fragmented file to replace it. That is done by creating a new, temporary file on the same filesystem, then unlinking it (while holding the file descriptor open). Calls to fallocate() can then be used to add the requisite number of blocks to the new file. Once the new file is up to the correct size, the daemon can use the FS_IOC_FIEMAP ioctl() to query the number of extents (fragments) it contains. If the new file is not an improvement over the old one, the daemon should just close it and give up; the filesystem simply does not have enough contiguous storage available.

The daemon could, at this point, simply copy the old file into the new one, then put the newly defragmented version in the place of the old one. The problems with that approach include performance (all that data must be copied through user space) and robustness. If some other process changes the file while the copy is happening, the new file may lose those changes. Indeed, if some process has the old file open, it may never notice that the replacement has happened. So something smarter is needed.

Akira's patch addresses these problems with the creation of a new, magic ioctl() call for ext4. The defragmentation application must fill out a structure like:

    struct move_extent {
	int org_fd;		/* original file descriptor */
	int dest_fd;		/* destination file descriptor */
	ext4_lblk_t start;	/* logical offset of org_fd and dest_fd*/
	ext4_lblk_t len;	/* exchange block length */
    };

This structure, when passed to the new EXT4_IOC_DEFRAG ioctl(), expresses a request to the kernel to move len blocks from the original file to the new one, starting at start. Essentially, it copies an extent's worth of data into the (fully allocated, nicely contiguous) space in the new file, then performs a magic block swap. The contiguous blocks from the new file are patched into the old file, while the fragmented blocks are, instead, put into the new file. Once the entire file has been treated in this way, the file will have been defragmented without having been visibly moved.

The final step is to delete the "new" file, which now contains the "old" file's blocks. Since the file had been unlinked, that will cause the filesystem to recover the old blocks and the task will be complete. For the curious, Akira has posted the source for a user-space defragmentation tool which shows how this interface can be used.

There have not been a whole lot of objections to the new code. Chris Mason did point out that the system will do unfortunate things if the layout of a swap file changes. He has clearly thought about the problem - to an extent:

Btrfs is currently getting around this by dropping bmap support, so swapfiles on btrfs won't work at all. A real long term solution is required ;)

Beyond that, there are some minor issues, such as the definition of the ABI in terms of types like int instead of architecture-independent types. Requests for separate source and destination block numbers have been made; that feature would help developers working on hierarchical storage systems. The ability to guide the allocation of blocks would be useful in situations where performance can be improved by grouping related files together on the disk.

There could also be value in finding a way to move much of this functionality into the VFS layer where it could be used with other filesystems as well; that could prove to be a difficult task, though, and ext4 maintainer Ted Ts'o has little desire to take on that job.

Those little issues notwithstanding, it does appear that the ext4 filesystem may be closer to getting the much-requested online defragmentation feature.

defragmentation of swap file?

Posted Feb 5, 2009 2:21 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

While it's obviously possible, I've never had any serious problem with the swap file getting extremely fragmented. It would be fine by me if online defragmentation of the swap file wasn't allowed. Instead of building complicated mechanisms for file systems to support that, and requiring file systems to use it, a relatively simple piece of kernel code could check whether the file in question was an active swap file, and deny the request from user space.

If you really need to do it, the user space utility could create a replacement swap file, defragment it (if necessary), enable swapping on the new file, then disable swapping on the old file and delete it. That's not significantly worse than the way the defragmentation patch works with regard to normal (non-swap) files. The user-space program could determine whether the file in question is a swap file, and deal with it appropriately.

defragmentation of swap file?

Posted Feb 5, 2009 12:43 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>It would be fine by me if online defragmentation of the swap file wasn't allowed.

This in fact, also holds true for FAT and NTFS. (It seems possible to get it moved by removing the system attrib from it, but who knows what side effects this has on a live swap...)

defragmentation of swap file?

Posted Feb 5, 2009 22:09 UTC (Thu) by yipyip (subscriber, #25108) [Link]

Indeed, and there's a utility for Windows called pagedefrag that will defragment the windows pagefile and other critical system files early on boot, before the system is up and using them.

Online defragmentation for ext4

Posted Feb 5, 2009 3:40 UTC (Thu) by dtlin (subscriber, #36537) [Link]

XFS already has working online defragmentation, via xfs_fsr and ioctl(XFS_IOC_SWAPEXT). Has anybody mentioned why online defragmentation ext4 can't just use the same interfaces?

Online defragmentation for ext4

Posted Feb 5, 2009 5:53 UTC (Thu) by dgc (subscriber, #6611) [Link]

The algorithm for creating a new file, preallocating and comparing
extents before doing any data movement is pretty much the same as what
xfs_fsr does. however, they differ in teh method of data exchange
algorithms.

xfs_fsr does all the data movement via direct IO in userspace
(i.e. scales extremely well). It swaps the extents atomically
if the inode has not changed between the start of the data copy
and the completion of it. Hence you can't defragment active files.

This was considered a fundamental blocker for ext4 even though
most active files never need defragmentation (think shared
libraries). Hence the ext4 patchset implements data movement inside
ext4 itself and so the kernel defrag code is much, much more complex
than the XFS swap extents ioctl. Userspace complexity is about the
same, but different APIs were required for ext4 to do it's "a bit at
a time" algorithm....

the need isn't limited to rotating media

Posted Feb 5, 2009 7:17 UTC (Thu) by dlang (subscriber, #313) [Link]

on flash disks the access time differences are smaller, but there are still advantages to sequential reads/writes compared to random ones.

in addition, flash devices actually write in blocks much larger than the typical block size (block sizes are in the 1-4K range, eraseblocks are in the 128K to multi-meg range).

a 'worst case' scenario with low-end current systems would be overwriting an existing file on a filesystem with 1k blocks and 256K eraseblocks. if you have a 1M file it could be 4 writes to the flash, or 1000 writes to the flash.

normally files don't get this fragmented, so you won't see that worst-case in real life, but the grouping can really matter.

another place where you could see a huge difference is on RAID 5/6 arrays. if your system checks parity on reads, a read of one block requires reading the entire stripe and verifying the parity data. if all your data is on that stripe you have it all, if not you have to repeat the process

for writes it can be even more drastic. to modify an existing stripe you need to read the existing block from the disk being changed and the parity disk(s), then calculate the changes to the parity and write the changes out. If you know that you are overwriting the entire stripe there is no need to read any old data first, just calculate the new parity and do the write. If you could do this (no system does this today), then streaming/large file writes to raid5/6 would be just about as fast as to raid0 (some additional cpu time, and the IO bandwidth of the parity disks, but unless the system is saturated these will not matter)

Online defragmentation for ext4

Posted Feb 5, 2009 11:13 UTC (Thu) by mp (subscriber, #5615) [Link]

If I understand correctly, there is also one other reason, not mentioned clearly in the article, why the defragmentation daemon needs some support from the kernel space.
The "put the newly defragmented version in the place of the old one" part must be done by updating the inode of the original or you would end up having to hunt and update all the directory entries pointing to the file.

Online defragmentation for ext4

Posted Feb 5, 2009 16:58 UTC (Thu) by nix (subscriber, #2304) [Link]

You'd also break everything that had the file open for writing if you left
it pointing at a deleted file.

Online defragmentation for ext4

Posted Feb 5, 2009 13:44 UTC (Thu) by brother_rat (subscriber, #1895) [Link]

This doesn't seem to help consolidate free space, which I'd have thought is a big part of limiting future fragmentation, and would also help make it easier to defragment existing large files.

[linuxkernelnewbies] Online defragmentation for ext4 [LWN.net]

Online defragmentation for ext4

[LWN subscriber-only content]

Reply via email to