http://lwn.net/Articles/317787/Online defragmentation for ext4[LWN subscriber-only content]
Any filesystem designed for use with rotating
media must pay careful
attention to the layout of files on the disk. If a file's blocks can be
placed sequentially on the device, they can be read or written as a
unit,
without the need for performance-destroying head seeks in the middle.
Even
the most careful filesystem will sometimes fail to lay out files in a
minimal number of contiguous extents, though. If a file grows, for
example, and the blocks just past the previous end are not available,
the
filesystem has no choice other than placing the new blocks somewhere
else.
Depending on how full the filesystem is, those blocks could end up far
away
indeed. This sort of fragmentation can result in filesystems slowing
down
over time.
Fragmentation problems can be fixed up after the fact. The most obvious way to defragment a disk is to make a new filesystem on it; after all, empty filesystems tend not to have fragmentation problems. But the new filesystem will have less fragmentation even after its old contents have been restored onto it. When the ultimate size of every file is known in advance, it's relatively easy to make good layout decisions. Knowing this, system administrators have used backup-and-restore cycles as a way of cleaning up overly fragmented disks for many years. There is, of course, a problem with this approach which goes beyond the risk of discovering that one's backup is not quite as good as one had thought. The downtime associated with rewriting a disk can be unwelcome to users; a filesystem which is down responds even more slowly than a filesystem with fragmentation problems. So it would be nice to have a way to defragment a filesystem while keeping it online and available. This online defragmentation capability has been on the ext4 "planned features" list for a long time; it is, at this point, about the only planned feature which has not yet been merged into the mainline. Some attempts at online defragmentation have been made in the past, but they have not, yet, gotten through review. Now Akira Fujita has come forward with a new ext4 online defragmentation patch which, by virtue of a different view of the problem, might just make it into the mainline. Previous attempts exposed an interface whereby a user-space application could ask the filesystem to defragment a specific file by allocating new (contiguous) blocks to it. That turned out to be a bit too much work to put into the kernel; so, with this patch, Akira has created an interface which moves a bit more of the work into user space. In the new scheme, a user-space defragmentation daemon will pick a file which, in its opinion, is too spread out on the disk. The daemon will then set about creating a new, less-fragmented file to replace it. That is done by creating a new, temporary file on the same filesystem, then unlinking it (while holding the file descriptor open). Calls to fallocate() can then be used to add the requisite number of blocks to the new file. Once the new file is up to the correct size, the daemon can use the FS_IOC_FIEMAP ioctl() to query the number of extents (fragments) it contains. If the new file is not an improvement over the old one, the daemon should just close it and give up; the filesystem simply does not have enough contiguous storage available. The daemon could, at this point, simply copy the old file into the new one, then put the newly defragmented version in the place of the old one. The problems with that approach include performance (all that data must be copied through user space) and robustness. If some other process changes the file while the copy is happening, the new file may lose those changes. Indeed, if some process has the old file open, it may never notice that the replacement has happened. So something smarter is needed. Akira's patch addresses these problems with the creation of a new, magic ioctl() call for ext4. The defragmentation application must fill out a structure like: struct move_extent {
int org_fd; /* original file descriptor */
int dest_fd; /* destination file descriptor */
ext4_lblk_t start; /* logical offset of org_fd and dest_fd*/
ext4_lblk_t len; /* exchange block length */
};
This structure, when passed to the new EXT4_IOC_DEFRAG ioctl(), expresses a request to the kernel to move len blocks from the original file to the new one, starting at start. Essentially, it copies an extent's worth of data into the (fully allocated, nicely contiguous) space in the new file, then performs a magic block swap. The contiguous blocks from the new file are patched into the old file, while the fragmented blocks are, instead, put into the new file. Once the entire file has been treated in this way, the file will have been defragmented without having been visibly moved. The final step is to delete the "new" file, which now contains the "old" file's blocks. Since the file had been unlinked, that will cause the filesystem to recover the old blocks and the task will be complete. For the curious, Akira has posted the source for a user-space defragmentation tool which shows how this interface can be used. There have not been a whole lot of objections to the new code. Chris Mason did point out that the system will do unfortunate things if the layout of a swap file changes. He has clearly thought about the problem - to an extent: Btrfs is currently getting
around this by dropping bmap support, so swapfiles on btrfs won't work
at all. A real long term solution is required ;)
Beyond that, there are some minor issues, such as the definition of the ABI in terms of types like int instead of architecture-independent types. Requests for separate source and destination block numbers have been made; that feature would help developers working on hierarchical storage systems. The ability to guide the allocation of blocks would be useful in situations where performance can be improved by grouping related files together on the disk. There could also be value in finding a way to move much of this functionality into the VFS layer where it could be used with other filesystems as well; that could prove to be a difficult task, though, and ext4 maintainer Ted Ts'o has little desire to take on that job. Those little issues notwithstanding, it does appear that the ext4 filesystem may be closer to getting the much-requested online defragmentation feature.
defragmentation of swap file? Posted Feb 5, 2009 2:21 UTC (Thu) by brouhaha (subscriber, #1698) [Link] While it's obviously possible, I've never had any serious problem with the swap file getting extremely fragmented. It would be fine by me if online defragmentation of the swap file wasn't allowed. Instead of building complicated mechanisms for file systems to support that, and requiring file systems to use it, a relatively simple piece of kernel code could check whether the file in question was an active swap file, and deny the request from user space.If you really need to do it, the user space utility could create a replacement swap file, defragment it (if necessary), enable swapping on the new file, then disable swapping on the old file and delete it. That's not significantly worse than the way the defragmentation patch works with regard to normal (non-swap) files. The user-space program could determine whether the file in question is a swap file, and deal with it appropriately.
defragmentation of swap file? Posted Feb 5, 2009 12:43 UTC (Thu) by jengelh (subscriber, #33263) [Link] >It
would be fine by me if online defragmentation of the swap file wasn't
allowed.
This in fact, also holds true for FAT and
NTFS. (It seems possible to
get it moved by removing the system attrib from it, but who knows what
side effects this has on a live swap...)
defragmentation of swap file? Posted Feb 5, 2009 22:09 UTC (Thu) by yipyip (subscriber, #25108) [Link] Indeed, and there's
a utility for Windows called pagedefrag that will
defragment the windows pagefile and other critical system files early
on boot, before the system is up and using them.
Online defragmentation for ext4 Posted Feb 5, 2009 3:40 UTC (Thu) by dtlin (subscriber, #36537) [Link] XFS already has
working online defragmentation, via xfs_fsr and
ioctl(XFS_IOC_SWAPEXT). Has anybody mentioned why online
defragmentation ext4 can't just use the same interfaces?
Online defragmentation for ext4 Posted Feb 5, 2009 5:53 UTC (Thu) by dgc (subscriber, #6611) [Link] The algorithm for
creating a new file, preallocating and comparing
extents before doing any data movement is pretty much the same as what xfs_fsr does. however, they differ in teh method of data exchange algorithms. xfs_fsr does all the data movement via direct
IO in userspace This was considered a fundamental blocker for
ext4 even though
the need isn't limited to rotating media Posted Feb 5, 2009 7:17 UTC (Thu) by dlang (subscriber, #313) [Link] on flash disks the
access time differences are smaller, but there are
still advantages to sequential reads/writes compared to random ones.
in addition, flash devices actually write in
blocks much larger than
the typical block size (block sizes are in the 1-4K range, eraseblocks
are in the 128K to multi-meg range). a 'worst case' scenario with low-end current
systems would be
overwriting an existing file on a filesystem with 1k blocks and 256K
eraseblocks. if you have a 1M file it could be 4 writes to the flash,
or 1000 writes to the flash. normally files don't get this fragmented, so
you won't see that worst-case in real life, but the grouping can really
matter. another place where you could see a huge
difference is on RAID 5/6
arrays. if your system checks parity on reads, a read of one block
requires reading the entire stripe and verifying the parity data. if
all your data is on that stripe you have it all, if not you have to
repeat the process for writes it can be even more drastic. to
modify an existing stripe
you need to read the existing block from the disk being changed and the
parity disk(s), then calculate the changes to the parity and write the
changes out. If you know that you are overwriting the entire stripe
there is no need to read any old data first, just calculate the new
parity and do the write. If you could do this (no system does this
today), then streaming/large file writes to raid5/6 would be just about
as fast as to raid0 (some additional cpu time, and the IO bandwidth of
the parity disks, but unless the system is saturated these will not
matter)
Online defragmentation for ext4 Posted Feb 5, 2009 11:13 UTC (Thu) by mp (subscriber, #5615) [Link] If I understand
correctly, there is also one other reason, not
mentioned clearly in the article, why the defragmentation daemon needs
some support from the kernel space.
The "put the newly defragmented version in the place of the old one" part must be done by updating the inode of the original or you would end up having to hunt and update all the directory entries pointing to the file.
Online defragmentation for ext4 Posted Feb 5, 2009 16:58 UTC (Thu) by nix (subscriber, #2304) [Link] You'd also break
everything that had the file open for writing if you left
it pointing at a deleted file.
Online defragmentation for ext4 Posted Feb 5, 2009 13:44 UTC (Thu) by brother_rat (subscriber, #1895) [Link] This doesn't seem
to help consolidate free space, which I'd have
thought is a big part of limiting future fragmentation, and would also
help make it easier to defragment existing large files.
|
