Good day all
Just following up of there is any advice from the community. My original
post was very long, but just wanted to paint the picture.
In summary I just want to find out if anybody has some concrete advice
on storage devices that is acceptable for usage with Postgres,
especially latency. I've highlighted some tests I've done, but I need to
interpret the numbers correctly.
Any feedback on the following will help
* Acceptable write IO latency
o WAL
o Data
o Temp
o ...
* Acceptable read IO latency
* Any other storage/drive related advice
Regards
Riaan
On 2026/02/11 01:13, Riaan Stander wrote:
Good day
We host our own Postgres (v17) server on-prem as the backbone of our
SaS application. It's a fairly busy OLTP application with a database
per tenant strategy. This obviously does complicate our setup.
Our hosting platform is as follows:
3 x Host Servers running Microsoft Storage Spaces in a 3 way mirror
Ubuntu VM hosting Postgres
A few months ago we had some severe performance issues with lots of
queries and writing operations just pending. After some deep
investigation we started realizing that it was disk IO causing the
issue. We used iostat and could see the write await was above 30ms and
sometimes even spiking much higher. This was resolved by moving our
backups (made with Veeam) from backing up the primary to a slave on
other infrastructure. Our current happy state where clients are not
experiencing issues is a iostat write await of 5ms and lower.
All was good for a few months until recently when this issue started
again. This time it could not be the backups. We had various hardware
vendors involved, but at some point it came to light that the Storage
Spaces hardware are all mechanical disks with NVME only used for
Storage Spaces journaling and caching. There are now some discussions
of upgrading drives to SSD, but my concern is that this is not
guaranteed to solve the issue. Especially with the 3 way mirror it
seems all writes will go to the other hosts before returning. So
latency is almost impossible to remove.
So now my question. I started running some IO tests using fio,
pg_test_fsync & pg_test_timing. Before we spend days/months trying to
tune Postgres settings I'm trying to get some definitive published
information about what IO numbers I should expect when running plain
hardware tests with Postgres completely out of the loop. I've seen
some info about 1ms and less write latency is what you want for WAL.
My logic says that if you have a stiffie drive for storage you can
tune it, but you still have a stiffie drive.
These are the tests I've run so far
1. WAL-Style Latency Test (4K random sync writes)
fio --name=wal-latency --filename=$TESTDIR/fio_wal_test --size=2G
--rw=randwrite --bs=4k --iodepth=1 --ioengine=libaio --direct=1
--fsync=1 --runtime=60 --group_reporting
2. Random Read IOPS Test (index lookup simulation)
fio --name=index-read --filename=$TESTDIR/fio_index_test --size=8G
--rw=randread --bs=4k --iodepth=32 --ioengine=libaio --direct=1
--runtime=60 --group_reporting
3. Mixed OLTP Test (70% read / 30% write)
fio --name=oltp-mixed --filename=$TESTDIR/fio_oltp_mixed --size=8G
--rw=randrw --rwmixread=70 --bs=8k --iodepth=32 --ioengine=libaio
--direct=1 --runtime=60 --group_reporting
4. Checkpoint Burst Test (sequential write pressure)
fio --name=checkpoint-burst --filename=$TESTDIR/fio_checkpoint
--size=20G --rw=write --bs=1M --iodepth=64 --ioengine=libaio
--direct=1 --runtime=60 --group_reporting
5. PostgreSQL fsync Code Path Test
pg_test_fsync -f $TESTDIR/pg_test_fsync
6. Timer / Scheduling Jitter Test
pg_test_timing -d 3
Regards
Riaan