Re: Initdb-time block size specification

Andres Freund Fri, 30 Jun 2023 15:05:36 -0700

Hi,

On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> > 
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
>
> I don't quite understand why we shouldn't do this (or at least try to).
>
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.


Mostly because I think there are bigger gains to be had elsewhere.

IME not a whole lot of storage ships by default with externally visible 4k
sectors, but needs to be manually reformated [1], which looses all data, so it
has to be done initially.  Then postgres would also need OS specific trickery
to figure out that indeed the IO stack is entirely 4k (checking sector size is
not enough).  And you run into the issue that suddenly the #column and
index-tuple-size limits are lower, which won't make it easier.


I think we should change the default of the WAL blocksize to 4k
though. There's practically no downsides, and it drastically reduces
postgres-side write amplification in many transactional workloads, by only
writing out partially filled 4k pages instead of partially filled 8k pages.


> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I don't think the runtime-dynamic overhead is avoidable with reasonable effort
(leaving aside compiling code multiple times and switching between).

If we were to start building postgres for multiple compile-time settings, I
think there are uses other than switching between BLCKSZ, potentially more
interesting. E.g. you can see substantially improved performance by being able
to use SSE4.2 without CPU dispatch (partially because it allows compiler
autovectorization, partially because it allows to compiler to use newer
non-vectorized math instructions (when targetting AVX IIRC), partially because
the dispatch overhead is not insubstantial).  Another example: ARMv8
performance is substantially better if you target ARMv8.1-A instead of
ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
me that they didn't do this earlier, ll/sc is just inherently inefficient).

Greetings,

Andres Freund


[1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep 
'^LBA Format';echo ;done

Re: Initdb-time block size specification

Reply via email to