Hi, On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote: > On 6/30/23 23:11, Andres Freund wrote: > > ... > > > > If we really wanted to do this - but I don't think we do - I'd argue for > > working on the buildsystem support to build the postgres binary multiple > > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just > > exec's the relevant "real" binary based on the pg_control value. I really > > don't see us ever wanting to make BLCKSZ runtime configurable within one > > postgres binary. There's just too much intrinsic overhead associated with > > that. > > I don't quite understand why we shouldn't do this (or at least try to). > > IMO the benefits of using smaller blocks were substantial (especially > for 4kB, most likely due matching the internal SSD page size). The other > benefits (reducing WAL volume) seem rather interesting too.
Mostly because I think there are bigger gains to be had elsewhere. IME not a whole lot of storage ships by default with externally visible 4k sectors, but needs to be manually reformated [1], which looses all data, so it has to be done initially. Then postgres would also need OS specific trickery to figure out that indeed the IO stack is entirely 4k (checking sector size is not enough). And you run into the issue that suddenly the #column and index-tuple-size limits are lower, which won't make it easier. I think we should change the default of the WAL blocksize to 4k though. There's practically no downsides, and it drastically reduces postgres-side write amplification in many transactional workloads, by only writing out partially filled 4k pages instead of partially filled 8k pages. > Sure, there are challenges (e.g. the overhead due to making it dynamic). > No doubt about that. I don't think the runtime-dynamic overhead is avoidable with reasonable effort (leaving aside compiling code multiple times and switching between). If we were to start building postgres for multiple compile-time settings, I think there are uses other than switching between BLCKSZ, potentially more interesting. E.g. you can see substantially improved performance by being able to use SSE4.2 without CPU dispatch (partially because it allows compiler autovectorization, partially because it allows to compiler to use newer non-vectorized math instructions (when targetting AVX IIRC), partially because the dispatch overhead is not insubstantial). Another example: ARMv8 performance is substantially better if you target ARMv8.1-A instead of ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles me that they didn't do this earlier, ll/sc is just inherently inefficient). Greetings, Andres Freund [1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done