March 1st, 2009 in Filesystems, Linux, SSD

On occasion, you will see the advice that the ext3 file system is not suitable for Solid State Disks (SSD’s) due to the extra writes caused by journaling — and so Linux users using SSD’s should use ext2 instead. However, is this folk wisdom actually true? This weekend, I decided to measure exactly what the write overhead of journaling actually is in actual practice.

For this experiment I used ext4, since I recently added a feature to track the amount of writes to the file system over its lifetime (to better gauge the wear and tear on an SSD). Ext4 also has the advantage that (starting in 2.6.29), it can support operations with and without a journal, allowing me to do a controlled experiment where I could manipulate only that one variable. The test workload I chose was a simple one:

Clone a git repository containing a linux source tree
Compile the linux source tree using make -j2
Remove the object files by running make clean

For the first test, I ran the test using no special mount options, and the only difference being the presence or absence of the has_journal feature. (That is, the first file system was created using mke2fs -t ext4 /dev/closure/testext4, while the second file system was created using mke2fs -t ext4 -O ^has_journal /dev/closure/testext4.)

Amount of data written (in megabytes) on an ext4 filesystem
Operation	with journal	w/o journal	percent change
git clone	367.7	353.0	4.00%
make	231.1	203.4	12.0%
make clean	14.6	7.7	47.3%

What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller: 4% for the git clone, and 12% for the actual kernel compile.

The noatime mount option

Can we do better? Yes, if we mount the file system using the noatime mount option:

Amount of data written (in megabytes) on an ext4 filesystem mounted with noatime
Operation	with journal	w/o journal	percent change
git clone	367.0	353.0	3.81%
make	207.6	199.4	3.95%
make clean	6.45	3.73	42.17%

This reduces the extra cost of the journal in the git clone and make steps to be just under 4%. What this shows is that most of the extra meta-data cost without the noatime mount option was caused by update to the last update time for kernel source files and directories.

The relatime mount option

There is a newer alternative to the noatime mount option, relatime. The relatime mount option updates the last access time of a file only if the last modified or last inode changed time is newer than the last accessed time. This allows programs to be able to determine whether a file has been read size it was last modified. The usual (actually, only) example that is given of such an application is the mutt mail-reader, which uses the last accessed time to determine if new mail has been delivered to Unix mail spool files. Unfortunately, relatime is not free. As you can see below, it has roughly double the overhead of noatime (but roughly half the overhead of using the standard Posix atime semantics):

Amount of data written (in megabytes) on an ext4 filesystem mounted with relatime
Operation	with journal	w/o journal	percent change
git clone	366.6	353.0	3.71%
make	216.8	203.7	6.04%
make clean	13.34	6.97	45.75%

Personally, I don’t think relatime is worth it. There are other ways of working around the issue with mutt — for example, you can use Maildir-style mailboxes, or you can use mutt’s check_mbox_size option. If the goal is to reduce unnecessary disk writes, I would mount my file systems using noatime, and use other workarounds as necessary. Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you don’t want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.

Comparing ext3 and ext2 filesystems

Amount of data written (in megabytes) on an ext3 and ext2 filesystem
Operation	ext3	ext2	percent change
git clone	374.6	357.2	4.64%
make	230.9	204.4	11.48%
make clean	14.56	6.54	55.08%

Finally, just to round things out, I tried the same experiment using the ext3 and ext2 file systems. The difference between these results and the ones involving ext4 are the result of the fact that ext2 does not have the directory index feature (aka htree support), and both ext2 and ext3 do not have extents support, but rather use the less efficient indirect block scheme. The ext2 and ext3 allocators are also someone different from each other, and from ext4. Still, the results are substantially similar with the first set of Posix-compliant atime update numbers (I didn’t bother to do noatime and relatime benchmark runs with ext2 and ext3, but I expect the results would be similar.)

Conclusion

So given all of this, where did the common folk wisdom that ext3 was not suitable for SSD’s come from? Some of it may have been from people worrying too much about extreme workloads such as “make clean”; but while doubling the write load sounds bad, going from 4MB to 7MB worth of writes isn’t that much compared to the write load of actually doing the kernel compile or populating the kernel source tree. No, the problem was that first generation SSD’s had a very bad problem with what has been called the “write amplification effect”, where a 4k write might cause a 128k region of the SSD to be erased and rewritten. In addition in order to provide safety against system crashes, ext3 has more synchronous write operations — that is where ext3 waits for the write operation to be complete before moving on, and this caused a very pronounced and noticeable stuttering effect which was fairly annoying to users. However, the next generation of SSD’s, such as Intel’s X25-M SSD, have worked around the write amplification affect.

What else have we learned? First of all, for normal workloads that include data writes, the overhead from journaling is actually relatively small (between 4 and 12%, depending on the workload). Further, than much of this overhead can be reduced by enabling the noatime option, with relatime providing some benefit, but ultimately if the goal is to reduce your file system’s write load, especially where an SSD is involved, I would strongly recommend the use of noatime over relatime.

Share and Enjoy:

Related posts (automatically generated):

Should Filesystems Be Optimized for SSD’s? In one of the comments to my last blog entry, an anonymous commenter writes: You seem to be taking a...
Aligning filesystems to an SSD’s erase block size I recently purchased a new toy, an Intel X25-M SSD, and when I was setting it up initially, I decided...
Fast ext4 fsck times This wasn’t one of the things we were explicitly engineering for when were designing the features that would go into...
Fast ext4 fsck times, revisited Last night I managed to finish up a rather satisfying improvement to ext4’s inode and block allocators. The ext4’s original...

Trackback URI | Comments RSS

16 Responses to “ SSD’s, Journaling, and noatime/relatime ”

# 1 Norman Yarvin Says:
March 2nd, 2009 at 12:29 am
I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal), which might blow out primitive wear-leveling schemes and result in those blocks becoming unreliable. (I don’t know quite where I got this notion from, though, and it certainly wasn’t anywhere authoritative.)
# 2 tytso Says:
March 2nd, 2009 at 1:15 am
@1: I’d thought that the reason to avoid ext3 on SSDs, at least most of the ones available today, was not the total number of writes but rather the repeated writes to the same place on the disk (that is, the journal)

Norman,

Actually, even the most primitive SSD’s and Flash drives have to get this right, because the Windows FAT filesystem are constantly updating the same locations on disk (namely for the File Allocation Table), which is in a fixed location on disk. So although there’s not a lot we can count on in terms of the quality of flash drives’ wear level, it’s very likely they get that right, since otherwise their reliability on basic FAT filesystems, which are used in essentially every single digital camera on the market, would be pretty bad.
# 3 Bergwolf Says:
March 2nd, 2009 at 2:56 am
Is it possible for ext4 to add an allocator that keeps track of the last write place and (when possible) allocate blocks from there on for SSD? btrfs is doing something similar for there are benchmarks showing that SSDs have better performance on sequential writes. But btrfs is still years away. It would be useful for ext4 to add some optimization for SSD.
# 4 Solid State Devices (SSD) and journaling | Beyond Abstraction Says:
March 2nd, 2009 at 3:12 am
[...] SSD Write Amplification: href="" class="moz-txt-link-freetext" href="http://www.extremetech.com/article2/0,2845,2329594,00.asp”">http://www.extremetech.com/article2/0,2845,2329594,00.asp”>http://www.extremetech.com/article2/0,2845,2329594,00.asp [...]
# 5 Georgi Chulkov Says:
March 2nd, 2009 at 5:30 am
Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount? I would love to be able to mark my USB stick this way, so that no matter where I plug it in, it will use noatime.
# 6 tytso Says:
March 2nd, 2009 at 9:31 am
@5: Is there a way to specify that a filesystem should always be mounted with noatime, even if the option is not given on mount?

There isn’t a way to do this as a mount option, but the easist thing to do is to set the noatime flag on the file system’s root directory when it’s freshly created, or to set the noatime flag for all files and directories, using the chattr command: “chattr -R +A /mntpt“, where you should replace /mntpt with the mount point of your thumbdrive.

The reason why this works is because the noatime flag is inherited, so all new files and directories created in a directory that has the noatime flag set will also have the noatime flag set. And if all of the files and directories in use on the file system has the noatime flag set, it’s functionally equivalent to mounting the filesystem with the noatime mount option.
# 7 TK Says:
March 2nd, 2009 at 12:43 pm
Thanks for “chattr” tip. I used it for my directories and it printed “Operation not supported while reading flags on …” for every symlink. Maybe that output should be only printed when in verbose mode?
# 8 операционные системы Linux/BSD » Оценка производительности файловых систем ext3 и ext4 на SSD накопителе Says:
March 2nd, 2009 at 3:10 pm
[...] “SSD’s, Journaling, and noatime/relatime” - сравнение производительности ФС ext3 и ext4 на SSD накопителе, и оценка влияния наличия журналирования и монтирования в режимах noatime/relatime. В качестве тестов производилось измерение времени клонирования git дерева и сборки Linux ядра. В тестах ext3 заметно проигрывает ext4: git clone был выполнен в ext4 на 4.64% быстрее, make - на 11.48%, make clean - на 55.08%. [...]
# 9 judicator Says:
March 2nd, 2009 at 5:55 pm
There is a big big mistake made about ssd in this article :

all flash memory (that includes ssd) are limited on how much you access to the memory for writing.

An access does not have direct relation with the units mesured here : the Megabyte.

the relation between those two units is complex and is not linear.

sometimes 10mb can be 1 access and another times it can be 100 access …….

The only thing to compare well is to measure the hits that the system made for writing ……. not how much megas it wrote.

In our case, the relationship take those factors in considerations :
- how the partition is made and how it is working
- how the system manage the partition

basicaly ext2 is better suited than ext3 ……… because ext3 does extra writes for the journal (it is not important how much megas it wrote, it has made a minimum of one writing cycle)

ext4 might be a good exit because I’ve read somewhere that this filesystem will have a new cache option especialy designed for ssd ……… basicaly that will do lesser write cycle because it will acces only once when his cache is near to full.
# 10 Soul_Est Says:
March 4th, 2009 at 2:49 pm
I’ve been this blog for a while hoping to gain some more insight into how to deal with SSDs in general since many Arch Linux users have been looking to increase their everyday performance over using HDDs. I found one thread sometime ago that suggested using reiserfs due to its fragmentation properties and another that suggested adding elevator=noop to the kernel boot parameters in the grub boot menu. Your thoughts have given me much to think about.
# 11 tytso Says:
March 4th, 2009 at 6:10 pm
@9: Judicator,

Actually, if you kept on reading all the way to the conclusion, you would have noted that I talked about the write amplification effect, and how with newer SSD’s, such as the Intel X25-M, this is much less of a factor — it has a write amplification factor averaging around 1.1, with a wear leveling overhead of 1.4, compared to older SSD’s that had a write amplification affect of 20 or more.

I believe you are also incorrect when you say that it’s about is “The only thing to compare well is to measure the hits that the system made for writing”. In fact, ext3 tends to pin writes until the transaction commit timer goes off, at which point the data blocks get flushed out, and then the journal blocks, and finally the metadata blocks. The real issue is that older SSD’s did their wear leveling in 128k erase block chunks, and so if you had writes which are scattered across the disk, a single 4k update in an erase block region caused the entire 128k erase block to be rewritten. The X25-M keeps track of disk block sectors at a smaller granularity than the 128k erase block segments, so the fact that writes are scattered across the disk doesn’t cause a massive write amplification effect.
# 12 tytso Says:
March 4th, 2009 at 6:31 pm
@10: Soul_Est,

I’m not that convinced that elevator=noop is the best idea for SSD’s, since combining writes is critical for SSD’s, and I’m not sure the noop elevator will be sufficiently aggressive at combining write requests. I have a feeling the deadline scheduler may be better choice, but I haven’t had a chance to benchmark it yet.
# 13 Anon Says:
March 6th, 2009 at 1:00 pm
@2:

My experience is rather different. On “generation 0″ SSDs the anecdotal comments say the wear levelling is group cyclic and worth very little. The SSD in my own EeePC has now developed bad parts (and that was using an ext2 filesystem mounted with noatime). Other blogs have commented on this too: (Val Aurora has something here: http://valhenson.livejournal.com/25228.html?thread=108940 and davej warns against putting swap on the EeePC “gen 0″ SSDs here http://kernelslacker.livejournal.com/132087.html (it’s a pity the comments have gone on davej’s blog - they were very good))

Now as it happens on my EeePC I used to use ext3 but have switched to ext2 (remember - gen 0 SSD). The biggest difference was in the latency of writes - this SSD has very slow writes. Booting has become a few seconds faster. With ext3 using firefox 3 (which is fsync happy) causes HUGE and very painful delays. With ext2 this is noticably less (but the stalls are still there just not as long). fsck goes quite quickly with ext2 but for some reason even when everything is OK the periodic startup fsck will force a reboot (which is painful). It’s also interesting to see that Ubuntu will never do a startup fsck on battery even if the FS was not properly unmounted…

I’ve also tried a few different io schedulers. The EeePC Xandros distro ships with a command line option to use deadline. I have also used cfq and noop (noop didn’t seem noticibably better than deadline). My hope is that cfq is worth it due to being able to have IO priorities. I have even twiddled the rotational flag that has appeared in 2.6.29 but sadly I don’t have benchmarks to be able to tell if it made a difference (I’m too worried that benchmarking this machine is going to decay the SSD further).

Incidentally I used to use ext3 on an SD card in the EeePC. Now THAT was interesting… After a period of time (seemingly not more than a few weeks) I would be pretty much assured the filesystem would develop self destroying corruption (where an fsck would go and delete everything in site). Since using just the FAT32 partition on the SD card I have not developed any further corruption (although since I stopped booting from it I have taken to write protecting the card whenever possible).
# 14 tytso Says:
March 6th, 2009 at 2:13 pm
The really big problem is that SSD manufacturers don’t feel obligated to tell you what sort of wear leveling algorithm they are using. There is a huge difference between those drives that do:
- No wear levelling at all — rare, since these drives die very quickly
- Those that do dynamic (or what you call “group cyclic”) wear levelling
- Those that do static wear levelling (this is where blocks that contain data that doesn’t move are also periodically moved around so writes can be distributed to those flash cells)
- Those that do sub-erase block allocation and wear levelling to reduce write amplification effects
The Intel SSD is the only one on the market which does the last, although rumor has it that a competitor will be showing up on the market within the next 30 days that will have similar capabilities. I can tell you that with the Intel X25-M SSD, which I now have installed as the primary disk in my laptop, I don’t see any stuttering and performance has been very agreeably fast. Also note that the fsync() issue in Firefox 3 was fixed by FF 3.0.1 (it may have been fixed in FF 3.0 final; I’m not 100% sure). So if you were seeing the problem, my guess is that your distro picked a pre-release FF 3 and didn’t bother to upgrade to a newer Firefox.

Finally, if you’re seeing filesystem corruption which required fsck to fix stuff that then required a reboot, my guess is there is something really bad going on. My guess is that you mentioned an SD card, and that there may have been some issues with the SD card getting jostled or the contacts not being secure that caused the data corruption. Even the crappy SD cards had wear levelling logic that noticed when a cell started going bad, and would stop using that flash cell. So that may have been more of a mechanical issue causing data loss, not a funamental flash card — but that being said, there are many laptops that have SD cards where I would not use them for regular data use, but only for pulling data off a card used by a digital camera — since that’s what they were probably primarily designed for.

I’ve had other people complain about certain notebooks where the SD card stuck out slightly, and when it was jostled, it would get disconnected and the filesystem would get horribly corrupted since (a) they weren’t using ext3, and (b) the filesystem was being written at the time when the SD card was nudged. About the only thing I can tell them is the response to the old joke, “Doctor, doctor, it hurts when I do that….”
# 15 Anon Says:
March 6th, 2009 at 2:44 pm
@14:
Re wear levelling:
So true. If only they would say! I’d be willing to pay a little more to have something that isn’t going to go bad in less than a year. However that kind of goes against your “even the most primitive SSD” statement that you made earlier. Surely you can’t get more primodial than gen 0 SSD? : ) Now you’ve said it I’m wondring if the stock EeePC SSDs really have no wear levelling at all *shudder*. No that’s too painful to even think about so I’m going to stop that thought there…

I wish I could afford an Intel SSD but I can’t and they don’t fit in EeePCs anyway. You speak of a utopia I cannot reach…

Alas the fsync issue was NOT fixed in Firefox 3.0.1. It was lessened slightly but as soon as sqlite starts writing after you’ve got a few links in your history you will really feel it (I can only suggest using an EeePC with an existing firefox 3 profile and you will see just how bad the SSD write speed is). Just for the record I have Firefox 3.0.6 on this machine and this is with the google bad site thing turned off. If you know where to look you will find that this bug is alive and well - https://bugzilla.mozilla.org/show_bug.cgi?id=442967 .

As for your last point - hehe! Well half of my SD Card filesystem was always ext3 the other half vfat and the vfat was (seemingly) always OK (but it was never the root fs). Do both a and b have to be present for the corruption to manifest or is b alone enough?
# 16 Soul_Est Says:
March 7th, 2009 at 11:56 pm
@12 (tytso)
Thanks for replying. I had no idea whether I read was best for SSD performance since unfortunately, I don’t own a good SSD (or notebook for that matter). I’ll relay what you posted to those in the Arch Linux forum as I believe many Archers need to know this. Thanks again.

[linuxkernelnewbies] SSD’s, Journaling, and noatim e/relatime | Thoughts by Ted

SSD’s, Journaling, and noatime/relatime

The noatime mount option

The relatime mount option

Comparing ext3 and ext2 filesystems

Conclusion

16 Responses to “ SSD’s, Journaling, and noatime/relatime ”

Reply via email to