https://twitter.com/marcan42/status/1494213855387734019

-------------------8<----------- cut here ---------- 8< -------------
Well, this is unfortunate. It turns out Apple's custom NVMe drives are
amazingly fast - if you don't care about data integrity.

If you do, they drop down to HDD performance. Thread.

For a while, we've noticed that random write performance with fsync
(data integrity) on Asahi Linux (and also on Linux on T2 Macs) was
terrible. As in 46 IOPS terrible. That's slower than many modern HDDs.

We thought we were missing something, since this didn't happen on macOS.

As it turns out, macOS cheats. On Linux, fsync() will both flush
writes to the drive, and ask it to flush its write cache to stable
storage.

But on macOS, fsync() only flushes writes to the drive. Instead, they
provide an F_FULLSYNC operation to do what fsync() does on Linux.

So effectively macOS cheats on benchmarks; fio on macOS does not give
numbers comparable to Linux, and databases and other applications
requiring data integrity on macOS need to special case it and use
F_FULLSYNC.

How bad is it if you use F_FULLSYNC? It's bad.

Single threaded, simple Python file rewrite test:

Macbook Air M1 (macOS):
- flushing: 46 IOPS
- not: 40000 IOPS

x86 iMac + WD SN550 1TB NVMe (Linux):
- flushing: 2000 IOPS
- not: 20000 IOPS

x86 laptop + Samsung SSD 860 EVO 500GB SATA:
- flushing: 143 IOPS
- not: 5000 IOPS

So, effectively, Apple's drive is faster than all the others without
cache flushes, but it is more than 3 times slower than a lowly SATA
SSD at flushing its cache. Even if all you wrote is a couple of
sectors. You pay a huge flush penalty if you do *any* writes.

Here, "flushing" on macOS means F_FULLSYNC and "not" means fsync(); on
Linux both are fsync(), but "not flushing" is measured by telling
Linux that the drive write cache is write-through (which stops it from
issuing cache flushes).

Note that the numbers are filesystem-dependent (and encryption makes
things more complicated); e.g. the SATA SSD numbers double on VFAT vs.
my root filesystem (ext4 on LVM on dm-crypt), but the pattern is
clear.

macOS doesn't even seem to try to proactively issue syncs; you can
write a file on macOS, fsync() it, wait 5 seconds, issue a hard reboot
(e.g. via USB-PD command), and the data is gone. That's pretty bad.

Of course, in normal usage, this is basically never an issue on
laptops; given the right software hooks, they should never run out of
power before the OS has a chance to issue a disk flush command. But it
certainly is for desktops. And it's a bit fragile re: panics and such.

Unfortunately, this manifests itself as quite visible issues on Linux.
For example, apt-get on Asahi Linux is noticeably slow. Making fsync()
not really flush on macOS is not fair; lots of portable software is
written to assume fsync() means your data is safe.

Our current thinking is we're going to add a knob to the NVMe driver
to defer flush requests up to a maximum time of e.g. 1 second. That
would ensure that a hard shutdown never loses you more than 1 second
of data, which is better than what macOS can claim right now.

Alas, that's still not quite safe. Not flushing means we cannot
guarantee ordering of writes, which means you could end up with actual
data corruption in e.g. a database, not just data loss. There's no
good way around this other than doing full flushes.

So the unfortunate conclusion is that if you're e.g. running a
transactional database on Apple hardware, and you need to be able to
survive a hard poweroff without data corruption, you're never going to
get more than ~46 TPS.

Unless Apple improves their ANS firmware to fix this.

And for what it's worth, I inadvertently triggered a data consistency
issue in macOS while testing this. Before running any tests I had
GarageBand open. I closed it without saving the open project. After
the first hard reboot later, it tried to reopen it and threw up an
error.

So I guess the unsaved project file got (partially?) deleted, but not
the state that tells it to reopen the currently open file on startup.

Data consistency matters.

Update: tested on the Mac Mini by pulling the plug (which is a
"normal" case; the USB-PD thing was a bit special). Still lost seconds
of fsync()ed data. There is no (working) last-gasp flush mechanism.

Further testing: doing this on FAT32 (to avoid APFS write
amplification, which is a separate problem...) doesn't help with the
IOPS on flush, but then I can look at powermetrics bandwidth stats to
see what NVMe is doing.

Artificially throttled ~30 IOPS without full cache flushes, I get
93.52 ops/s 397.77 KBytes/s disk IO, which is reasonable, and:

ANS2 RD                         : 0.328 MB/s
ANS2 WR                         : 0.399 MB/s

That's NVMe controller DRAM bandwidth. Reasonable.

But doing flushes, this jumps to:

ANS2 RD                         : 6.248 MB/s
ANS2 WR                         : 9.909 MB/s

So that NVMe controller is doing *something* memory-intensive on flush
commands. Maybe it does a linear scan over a large cache hashtable?

I can get these stats because the NVMe controller is integrated into
the SoC and "drive cache" is just normal memory (it's unified just
like GPU memory), and Apple have very good instrumentation of things
like DRAM bandwidth utilization from different SoC agents.

Uh, guys? I'm not saying "the machines are only fast because they
cheat on data durability". I'm saying there's a stupid, unfortunate
performance bug when you *do* want durability that most people won't
notice because macOS gives you poor durability by default.

These SSDs are still fast for many use cases, and on laptops you don't
care about this issue because there's a battery backup anyway.

And besides, Apple can almost certainly fix this in a firmware upgrade.

Chances are this happened because this design came from iDevices,
where there's always a battery, so software will never hit drive cache
related consistency/durability issues, so probably ~no iOS software
ever uses a full sync, so nobody noticed it's slow.

And I'm just the Nth guy to facepalm when I found out that macOS makes
data durability double-opt-in with a nonstandard request. That's dumb,
but it has nothing to do with the machines. It's just the reason why
fio was magically fast on macOS and not on Linux.

Note that the NVMe controller behaves in a perfectly spec-compliant
way as far as data durability. It's just (ridiculously) slow when you
ask for it. Probably a bug.

But since incessant people on HN made me do another test, and for giggles:

FAT32 on internal NVMe with F_FULLFSYNC: 58 IOPS
FAT32 on a cheapo USB3 flash drive with the same: 223 IOPS.

If you run a durable database on an M1 today, try a cheap flash drive.
It'll be faster :-)

Of course, the flash drive doesn't have a cache at all, so it remains
at 223 IOPS with fsync() vs. 60000 IOPS or so on NVMe - so make sure
you go with NVMe if you're on a laptop where it's safe, only use the
cheapo drive on Mac Minis and iMacs!
-------------------8<----------- cut here ---------- 8< -------------

So ...I guess maybe, sometimes, substance matters?

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to