What ever happened to chunkfs?

[LWN subscriber-only content]

June 17, 2009

This article was contributed by Valerie Aurora (formerly Henson)

"What ever happened to chunkfs?" This is a question I hear every few months, and I always answer the same way, "Chunkfs works, the overhead is reasonable, and it is only practical if it is part of the file system design from the beginning, not tacked on after the fact. I just need to write up the paper summarizing all the data." Thanks to your benevolent LWN editor, you are now reading that paper.

Background

Before we describe chunkfs, we should first talk about the problems that led to its creation. Chunkfs was motivated by the growing difficulty of reading and processing all the data necessary to check an entire file system. As the capacity of storage grows, the capacity of the rest of the system to read and check that data is not keeping pace with that growth. As a result, the time to check a "normal" file system grows, rather than staying constant as it would if the memory, bandwidth, seek time, etc. of the system grew in proportion to its storage capacity. These differential rates of growth in hardware - like the differences between RAM capacity, bandwidth, and latency - are part of what keeps systems research interesting.

To understand the change in time to check and repair a file system, a useful analogy is to compare the growth of my library (yes, some people still own books) with the growth of my ability to read, organize, and find the books in the library. As a child, my books fit all on one shelf. I could find the book I wanted in a matter of seconds and read every single one of my books in a week. As an adult, my books take up several bookshelves (in addition to collecting on any flat surface). I can read many times faster than I could when I was a kid, and organize my books better, but finding a book can take several minutes, and reading them all would take me a year or more. If I was trying to find a twenty dollar bill I left in a book, it would take me several hours to riffle through all the books to find it. Even though I am a better reader now, my library grew faster than my ability to read or search my library. Similarly, computers are faster than ever, but storage capacity grew even faster.

The consequence of this phenomenon that first caught my attention was the enormous disparity between seek time and capacity of disks. I calculated a projected change in fsck time given these predictions [PDF] from a storage industry giant in 2006:

Improvement in disk performance, 2006 - 2013
Capacity:	16.0x
Bandwidth:	5.0x
Seek time:	1.2x
=> at least 10x increase in fsck time!

Since then, commodity flash-based storage finally became a practical reality. Solid-state disk (SSDs) have much faster "seeks", somewhat improved bandwidth, and drastically reduced capacity compared to disks, so the performance of fsck ought to be extremely good. (I am unable to find any measurements of fsck time on an SSD, but I would estimate it to be on the order of seconds. Readers?) However, SSDs don't solve all our problems. First, SSDs are an element in the cache hierarchy of a system, layered between system RAM and disk, not a complete replacement for disks. The sweet spot for SSDs will continue to expand, but it will be many years or decades before disks are completely eliminated - look at the history of tape. It's easy for a laptop user to wave their hands and say, "Disks are obsolete," but try telling that to Google, your mail server sysadmin, or even your MythTV box.

Second, remember that the whole reason we care about fsck in the first place is because file systems get corrupted. One important source of file system corruption is failures in the storage hardware itself. Ask any SSD manufacturer about the frequency of corruption on its hardware and they will inundate you with equations, lab measurements, and simulations showing that the flash in their SSDs will never wear out in "normal" use - but they also won't reveal the details of their wear-leveling algorithms or the workloads they used to make their predictions. As a result, we get surprises in real-world use, both in performance and in reliability. When it comes to failure rates, we don't have good statistics yet, so I have to fall back on my experience and that of my kernel hacker friends. What we see in practice is that SSDs corrupt more often and more quickly than disks, and prefer to devour the tastiest, most crucial parts of the file system. You can trust the hardware manufacturers when they wave their hands and tell you not to worry your pretty little head, but I put more weight on the money I've made consulting for people with corrupted SSDs.

So, if corruption is a concern, an important benefit of the chunkfs approach is that file system corruption is usually limited to a small part of the file system, regardless of the source of the corruption. Several other repair-driven file system principles - such as duplicating and checksumming important data - make recovery of data after corruption much more likely. So if you don't care about fsck time because you'll be using an SSD for the rest of your life, you might still read on because you care about getting your data back from your SSD.

The conclusion I came to is that file systems should be designed with fast, reliable check and repair as a goal, from the ground up. I outline some useful methods for achieving this goal in a short paper: Repair-driven File System Design [PDF].

Chunkfs design

Chunkfs is the unmusical name for a file system architecture designed under the assumption that the file system will be corrupted at some point. Therefore the on-disk format is optimized not only for run-time performance but also for fast, reliable file system check and repair. The basic concept behind chunkfs is that a single logical file system is constructed out of multiple individual file systems (chunks), each of which can be checked and repaired (fscked) individually.

[chunkfs]

This is great and all, but now we have a hundred little file systems, with the namespace and disk space fragmented amongst them - hardly an improvement. For example, if we have a 100GB file system divided into 100 1GB chunks, and we want to create a 2GB file, it won't fit in a single chunk - it has to somehow be shared across multiple chunks. So how do we glue all the chunks back together again so we can share the namespace and disk space while preserving the ability to check and repair each chunk individually? What we want is a way to connect a file or directory in one chunk and another file or directory in another chunk in such a way that the connection can be quickly and easily checked and repaired without a full fsck of the rest of the chunk. The solution is something we named a "continuation inode". A continuation inode "continues" a file or directory into another chunk.

[file growth]

You'll notice that there are two arrows in the picture to the left, one pointing from the original inode to the continuation inode, and another pointing back from the continuation inode to the original inode. When checking the second chunk, you can check the validity of the continuation inode quickly, by using the back pointer to look up the original inode in its chunk. This is a check of a "cross-chunk reference" - any file system metadata that makes a connection between data in two different chunks. Cross-chunk references must always have forward and back pointers so that they can be verified starting from either chunk.

Cross-chunk references must satisfy one other requirement: You must be able to quickly find all cross-references from chunk A to arbitrary chunk B, without searching or checking the entire chunk A. To see why this must be, we'll look at an example. Chunk A has an inode X which is continued into chunk B. What if chunk A was corrupted in such a manner that inode X was lost entirely? We would finish checking and repairing chunk A, and it would be internally self-consistent, but chunk B would still have a continuation inode for X that is now impossible to look up or delete - an orphan continuation inode. So as the second step in checking chunk A, we must then find all references to chunk A from chunk B (and all the other chunks) and check that they are in agreement. If we couldn't, we would have to search every other chunk in the file system to check a single chunk - and that's not quick.

A chunkfs-style file system can be checked and repaired incrementally and online, whereas most file systems must be checked all at once while the file system is offline. Some exceptions to this general rule do exist, such as the BSD FFS snapshot-based online check, and an online self-healing mechanism for NTFS, but in general these facilities are hacked on after the fact and are severely limited in scope. For example, if the BSD online check actually finds a problem, the file system must be unmounted and repaired offline, all at once, in the usual way, including a rerun of the checking stage of fsck.

The chunkfs design requires solutions to several knotty problems we don't have room to cover in this article, but you can read about in our 2006 Hot Topics in Dependability paper: Chunkfs: Using divide-and-conquer to improve file system reliability and repair [PDF].

Measurements

Repair-driven file system design, chunkfs in particular, sounds like a neat idea, but as a file system developer, I've had a lot of neat ideas that turned out to be impossible to implement, or had too much overhead to be practical. The questions I needed to answer about chunkfs were: Can it be done? If it can be done, will the overhead of continuation inodes outweigh the benefit? In particular, we need to balance the time it takes to check an individual chunk with the time it takes to check its connections to other chunks (making sure that the forward and back pointers of the continuation inodes agree with their partners in the other chunk). First, let's take a closer look at file system check time on chunkfs.

The time to check and repair a file system with one damaged chunk is the sum of two different components, the time to check one chunk internally and the time to check the cross references to and from this chunk.

T_fs = T_chunk + T_cross

The time to check one chunk is a function of the size of the chunk, and the size of the chunk is the total size of the file system divided by the number of chunks.

T_chunk = f(size_chunk)

size_chunk = size_fs / n_chunks

The time to check the cross-chunk references to and from this chunk depends on the number of those cross references. The exact number of cross-chunk references will vary, but in general larger chunks will have fewer cross references and smaller chunks will have more - that is, cross chunk check time will grow as the number of chunks grow.

T_cross = f(n_chunks)

So, per-chunk check time gets smaller as we divide the file system into more chunks, and at the same time the cross-chunk check time grows larger. Additionally, the extra disk space taken up by continuation inodes grows as the number of chunks grows, as does the overhead of looking up and following continuation inodes during normal operation. We want to find a sweet spot where the sum of the time to check a chunk and its cross-references is minimized, while keeping the runtime overhead of the continuation inodes small.

Does such a sweet spot exist? The answer depends on the layout of files and directories on disk in real life. If cross-chunk references are extremely common, then the overhead of continuation inodes will outweigh the benefits. We came up with an easy way to estimate the number of cross-chunk references in a chunkfs file system: Take a "real" in-use ext3 file system and for each file, measure the number of block groups containing data from that file. Then, for each directory, measure the number of block groups containing files from that directory. If we add up the number of block groups less one for all the files and directories, we'll get the number of cross-chunk references in a similar chunkfs file system with chunkfs the size of the ext3 file system's block groups (details here).

Karuna Sagar wrote a tool to measure these cross-references, called cref, and I added some code to do a worst-case estimate of the time to check the cross-references (assuming one seek for every cross-reference). The results were encouraging; assuming disk hardware progresses as predicted, the average cross-chunk cross-reference checking time would be about 5 seconds in 2013, and the worst case would be about 160 seconds (about 2.5 minutes). This is with a 1GB chunk size, so the time to check the chunk itself would be a few seconds. This estimate is worst-case in another way: the ext3 allocator is in no way optimized to reduce cross-chunk references. A chunkfs-style file system would have a more suitable allocator.

Implementations

Chunkfs was prototyped three times. The first prototype, written by Amit Gud for his master's thesis [PDF] in 2006-2007, implemented chunkfs as modifications to the ext2 driver in FUSE. Note that the design of chunkfs described in the thesis is out of date in some respects, see the chunkfs paper [PDF] for the most recent version. He also ported this implementation to the kernel in mid-2007, just for kicks. The results were encouraging. Our main concern was that continuation inodes would proliferate wildly, overwhelming any benefits. Instead, files with continuation inodes were uncommon - 2.5% of all files in the file systems - and no individual file had more than 10 continuation inodes. The test workload included some simulation of aging by periodically deleting files while filling the test file system. (It would be interesting to see the results from using Impressions [PDF], a very sophisticated tool for generating realistic file system images.)

These implementations were good first steps, but they were based on an earlier version of the chunkfs design, before we had solved some important problems. In these implementations, any chunk that had been written to since the file system was mounted had to be fscked after a crash. Given that most of the file system is idle in common usage, this reduced check time by about 2/3 in the test cases, but we are looking for a factor of 100, not a factor of 3. It also lacked a quick way to locate all of the references into a particular damaged chunk, so it only checked the references leading out of the chunk. It used an old version of the solution for hard links which would allow hard links to fail if the target's chunk ran out of space, instead of growing a continuation for the target into a chunk that did have space. It was a good first step, but lacked a key feature: drastically reduced file system repair time.

In mid-2007, I decided to write a prototype of chunkfs as a layered file system, similar to unionfs or ecryptfs, that was completely independent of the client file system used in each chunk. Continuation inodes are implemented as a regular files, using extended attributes to store the continuation-related metadata (the forward and back pointers and the offset and length of the file stored in this continuation inode). When the file data exceeds an arbitrary size (40K in my prototype), a continuation inode is allocated in another chunk and the data after that point in the file is stored in that inode's data. All of the continuation inodes emanating from a particular chunk are kept in one directory so that they can be quickly scanned during the cross-chunk pass of fsck.

To test that the prototype could recover from file system corruption in one chunk without checking the entire file system, I implemented fsck as a simple shell script. First, it fscks the damaged chunk by running the ext2 fsck on it. This may end up deleting or moving arbitrary files in the chunk, which could make it out of sync with the other chunks. Second, it mounts the now-repaired file systems and reads their /chunks directories to find all connections to the damaged chunk and consistency check them. If it finds an orphaned continuation - a continuation whose origin inode in the damaged chunk was destroyed or lost - then it deletes that continuation inode.

The fsck script is deficient in one particular aspect: it checks all the chunks because I didn't write the code to mark a chunk as damaged. This is a difficult problem in general; sometimes we know that a chunk has been damaged because the disk gave us an IO error, or we found an inconsistency while using the file system, and then we mark the file system as corrupted and needing fsck. But plenty of corruption is silent - how can we figure out which chunk was silently corrupted? We can't, but we can quietly "scrub" chunks by fscking them in the background. Currently, ext2/3/4 triggers a paranoia fsck every N mounts or M days since the last check. Unfortunately, this introduces an unexpected delay of minutes or hours at boot time; if you're lucky, you can go have a cup of coffee while it finishes, if you're unlucky, you'll be figuring out how to disable it while everyone who has come to your talk about improvements in Linux usability watches you. (N.B.: Reboot and add "fastboot" to the kernel command line.) With chunkfs, we could run fsck on a few chunks at every boot, adding a few seconds to every boot but avoiding occasional long delays. We could also fsck inactive chunks in the background while the file system in use.

Results

I gave a talk at LCA in January 2008 about chunkfs which included a live demo of the final chunkfs prototype in action: creating files, continuing them into the next chunk, deliberately damaging a chunk, and repairing the file system. Because I carefully prepared a typescript-based fallback in case things went sideways, the demo went perfectly. A video of the talk [Ogg Theora] is available.

Note that I have never been able to bring myself to watch this talk, so I don't know if it contains any embarrassing mistakes. If you want, you can follow along during the demo section with the annotated output from the demo.

The collection-of-file-systems approach ended up being more complex than I had hoped. The prototype used ext2 as the per-chunk file system - a reasonable choice for some throw-away code, but not workable in production. However, once you switch to any reliable file system, you end up with one journal per chunk, which is quite a lot of overhead. In addition, it seemed likely we'd need another journal to recover from crashes during cross-chunk operations. Another source of complexity was the use of a layered file system approach - basically, the chunkfs layer pretends to be a client file system when it's talking to the VFS, and pretends to be the VFS when it's talking to the per-chunk file system. Since none of this is officially supported, we end up with a long list of hacks and workarounds, and it's easy to run into surprise reference counting or locking bugs. Overall, the approach worked for a prototype, but it didn't seem worth the investment it would to make it production quality, especially with the advent of btrfs.

Future work

When I began working on chunkfs, the future of Linux file systems development looked bleak. The chunkfs in-kernel prototype looked like it might be the only practical way to get repair-driven design into a Linux file system. All that has changed; file system development is a popular, well-funded activity and Linux has a promising next-generation file system, btrfs, which implements many principles of repair-driven file system design, including checksums, magic numbers, and redundant metadata. Chunkfs has served its purpose as a demonstration of the power of the repair-driven design approach and I have no further development plans for it.

Conclusions

The three chunkfs prototypes and our estimates of cross-chunk references using real-world file systems showed that the chunkfs architecture works as advertised. The prototypes also convinced us that it would be difficult to retrofit existing journaling file systems to the chunkfs architecture. Features that make file system check and repair are best when designed into the architecture of the file system from the beginning. Btrfs is an example of a file system designed from the ground up with the goal of fast, reliable check and repair.

Credits

Many people and organizations generously contributed to the design and prototyping of chunkfs. Arjan van de Ven was the co-inventor of the original chunkfs architecture. Theodore Ts'o and Zach Brown gave invaluable advice and criticism while we were thrashing out the details. The participants of the Hot Topics in Dependability Workshop gave us valuable feedback and encouragement. Karuna Sagar wrote the cross-reference measuring tool that gave us the confidence to go forward with an implementation. Amit Gud wrote two prototypes while a graduate student at Kansas State University. The development of the third prototype was funded at various points by Intel, EMC, and VAH Consulting. Finally, too many people to list made valuable contributions during discussions about chunkfs on mailing lists, and I greatly appreciate their help.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 13:41 UTC (Wed) by qu1j0t3 (subscriber, #25786) [Link]

But there are several other mature filesystems which avoid it on an unexpected shutdown -
reiser3fs, just-released NILFS2, and of course the magnum opus, ZFS.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 14:21 UTC (Wed) by anselm (subscriber, #2796) [Link]

Not having to do an fsck after an unclean shutdown is not the same as not being able to do an fsck at all even if one wanted to. File systems can become corrupted for various reasons other than sudden system crashes.

Maybe it's just me, but I don't think I'd be happy with a file system for which there is no reasonable fsck program.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 17:54 UTC (Wed) by drag (subscriber, #31333) [Link]

Yes. Resierfs needs a fsck... whether or not it provides is kinda irrelevent. I think that a fsck for reiserfs is provided, but I am not sure. Maybe only for later versions. Never used it much myself.

One of the strengths of Ext3 over XFS and Reiserfs is it's fsck. The journalling features of XFS and Reiserfs only protect the filesystem (aka metadata) from corruption, it does not help protect your actual data or detect problems with your data. For that you need to do fsck for Ext3.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 21:36 UTC (Wed) by anselm (subscriber, #2796) [Link]

What I've heard about the Reiserfs fsck is that it will, among other issues, mistake a file system superblock in the middle of a partition for the start of a whole new file system and get thoroughly confused. With virtualisation in widespread use and people keeping file system images in files this is more of a problem than it used to be when Reiserfs was new.

If this is actually true, it is one more reason to bury Reiserfs deeply, with a wooden stake driven through its heart.

What ever happened to chunkfs?

Posted Jun 17, 2009 16:17 UTC (Wed) by philipstorry (subscriber, #45926) [Link]

An excellent article - well written and on a very interesting topic.

I also think it's worth thanking Valerie for stopping development. It can be a difficult decision to do so, and takes bravery and good judgement. Chunkfs, especially as a VFS, would likely have become one huge furball of hacks. To accept that it had served its purpose and that time could be better spent elsewhere looks like the right thing.

Now all I have to do is wait for btrfs to bring me all of this. I have resorted to manual mirroring via scheduled rysncs, which balances all the pros and cons nicely for me. I'd never stop doing mirroring in some form, but to be able to trust hardware (or any lower level that includes the fs) mirroring would be nice for fileservers.

So, thanks again Valerie and all involved with chunkfs. Good to know that the future will be better! :-)

[linuxkernelnewbies] What ever happened to chunkfs? [LWN.net]

What ever happened to chunkfs?

[LWN subscriber-only content]

Background

Chunkfs design

Measurements

Implementations

Results

Future work

Conclusions

Credits

Reply via email to