Hi Tomas,

> Hi,
> 
> At on of the pgcon unconference sessions a couple days ago, I presented a
> bunch of benchmark results comparing performance with different data/WAL
> block size. Most of the OLTP results showed significant gains (up to 50%) with
> smaller (4k) data pages.

Nice. I just saw this 
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you 
have any plans for publishing those other graphs too (e.g. WAL block size 
impact)?

> This opened a long discussion about possible explanations - I claimed one of 
> the
> main factors is the adoption of flash storage, due to pretty fundamental
> differences between HDD and SSD systems. But the discussion concluded with an
> agreement to continue investigating this, so here's an attempt to support the
> claim with some measurements/data.
> 
> Let me present results of low-level fio benchmarks on a couple different HDD
> and SSD drives. This should eliminate any postgres-related influence (e.g. 
> FPW),
> and demonstrates inherent HDD/SSD differences.
> All the SSD results show this behavior - the Optane and Samsung nicely show
> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
> worse.
> 
[..]
Can you share what Linux kernel version, what filesystem , it's mount options 
and LVM setup were you using if any(?)

I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS 
i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio only, fio deviations: 
runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. 

> Now, compare this to the SSD. There are some differences between the models, 
> manufacturers, interface etc. but the impact of page size on IOPS is pretty 
> clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung 
> it's even more, etc. This means that workloads dominated by random I/O get 
> significant benefit from smaller pages.

Yup, same here, reproduced, 1.42x faster on writes:
[root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
[root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
1k/1.txt: bw=24162KB/s, iops=24161,
2k/1.txt: bw=47164KB/s, iops=23582,
4k/1.txt: bw=280450KB/s, iops=70112, <<<
8k/1.txt: bw=393082KB/s, iops=49135,
16k/1.txt: bw=393103KB/s, iops=24568,
32k/1.txt: bw=393283KB/s, iops=12290,
BTW it's interesting to compare to your's Optane 900P result (same two high 
bars for IOPS @ 4,8kB), but in my case it's even more import to select 4kB so 
it behaves more like Samsung 860 in your case

# 1.41x on randreads
[root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
[root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
1k/1.txt: bw=169938KB/s, iops=169937,
2k/1.txt: bw=376653KB/s, iops=188326,
4k/1.txt: bw=691529KB/s, iops=172882, <<<
8k/1.txt: bw=976916KB/s, iops=122114,
16k/1.txt: bw=990524KB/s, iops=61907,
32k/1.txt: bw=974318KB/s, iops=30447,

I think that the above just a demonstration of device bandwidth saturation: 
32k*30k IOPS =~ 1GB/s random reads. Given that DB would be tuned @ 4kB for 
app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could 
only achieve 70% of what it could achieve on 8kB, correct? (I'm assuming most 
real systems are really OLTP but with some reporting/data exporting needs). One 
way or another it would be very nice to be able to select the tradeoff using 
initdb(1) without the need to recompile, which then begs for some initdb 
--calibrate /mnt/nvme (effective_io_concurrency, DB page size, ...).

Do you envision any plans for this we still in a need to gather more info 
exactly why this happens? (perf reports?)

Also have you guys discussed on that meeting any long-term future plans on 
storage layer by any chance ? If sticking to 4kB pages on DB/page size/hardware 
sector size, wouldn't it be possible to win also disabling FPWs in the longer 
run using uring (assuming O_DIRECT | O_ATOMIC one day?)
I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that 
pretty nicely in [1] 

[1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

Attachment: libaio-2022-06-06.tgz
Description: libaio-2022-06-06.tgz

Reply via email to