Hi, >From what I see some time ago the write lifetime hints support for NVMe multi streaming was merged into Linux kernel [1]. Theoretically it allows data written together on media so they can be erased together, which minimizes garbage collection, resulting in reduced write amplification as well as efficient flash utilization [2]. I couldn't find any discussion about that on hackers, so I decided to experiment with this feature a bit. My idea was to test quite naive approach when all file descriptors, that are related to temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them `RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any infrastructure around to enable/disable hints.
It turns out that it's possible to perform benchmarks on some EC2 instance types (e.g. c5) with the corresponding version of the kernel, since they expose a volume as nvme device: ``` # nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store 1 0.00 B / 8.59 GB 512 B + 0 B 1.0 ``` To get some baseline results I've run several rounds of pgbench on these quite modest instances (dedicated, with optimized EBS) with slightly adjusted `max_wal_size` and with default configuration: $ pgbench -s 200 -i $ pgbench -T 600 -c 2 -j 2 Analyzing `strace` output I can see that during this test there were some significant number of operations with pg_stat_tmp and xlogtemp, so I assume write lifetime hints should have some effect. As a result I've got reduction of latency about 5-8% (but so far these numbers are unstable, probably because of virtualization). ``` # without patch number of transactions actually processed: 491945 latency average = 2.439 ms tps = 819.906323 (including connections establishing) tps = 819.908755 (excluding connections establishing) ``` ``` with patch number of transactions actually processed: 521805 latency average = 2.300 ms tps = 869.665330 (including connections establishing) tps = 869.668026 (excluding connections establishing) ``` So I have a few questions: * Does it sound interesting and worthwhile to create a proper patch? * Maybe someone else has similar results? * Any suggestions about what can be the best/worst case scenarios of using such kind of hints? [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 [2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf
nvme_write_lifetime_poc.patch
Description: Binary data