Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

wenhui qiu Thu, 19 Feb 2026 00:20:11 -0800

HI Robert
     In fact, this was discussed over a decade ago.（
https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com
）,In practice, this mainly stems from the significant overhead introduced
by FPW. The community has adopted various approaches to mitigate its
impact, including compressing FPW and extending checkpoint intervals.
Nowadays, waiting for the primary to recover instead of performing an HA
switchover is generally considered unacceptable.



Thanks

On Thu, Feb 19, 2026 at 2:00 AM Robert Treat <[email protected]> wrote:

> On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
> <[email protected]> wrote:
> > On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <[email protected]> wrote:
> > >
> > > Hi hackers,
> > >
> > > I raised this topic a while back [1] but didn't get much traction, so
> > > I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
> > > for PostgreSQL as an alternative to full_page_writes.
> > >
> > > The core argument is straightforward. FPW and checkpoint frequency are
> > > fundamentally at odds:
> > >
> > > - FPW wants fewer checkpoints -- each checkpoint triggers a wave of
> > > full-page WAL writes for every page dirtied for the first time,
> > > bloating WAL and tanking write throughput.
> > > - Fast crash recovery wants more checkpoints -- less WAL to replay
> > > means the database comes back sooner.
> > >
> > > DWB resolves this tension by moving torn page protection out of the
> > > WAL path entirely. Instead of writing full pages into WAL (foreground,
> > > latency-sensitive), dirty pages are sequentially written to a
> > > dedicated doublewrite buffer area on disk before being flushed to
> > > their actual locations. The buffer is fsync'd once when full, then
> > > pages are scatter-written to their final positions. On crash recovery,
> > > intact copies from the DWB repair any torn pages.
> > >
> > > Key design differences:
> > >
> > > - FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL
> latency
> > > - DWB: 2 page writes (background flush path) = minimal user-visible
> impact
> > > - DWB batches fsync() across multiple pages; WAL fsync batching is
> > > limited by foreground latency constraints
> > > - DWB decouples torn page protection from checkpoint frequency, so you
> > > can checkpoint as often as you want without write amplification
> > >
> > > I ran sysbench benchmarks (io-bound, --tables=10
> > > --table_size=10000000) with checkpoint_timeout=30s,
> > > shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
> > > database, VACUUM FULL, 60s warmup, 300s run.
> > >
> > > Results (TPS):
> > >
> > >                      FPW OFF    FPW ON     DWB ON
> > > read_write/32        18,038      7,943     13,009
> > > read_write/64        24,249      9,533     15,387
> > > read_write/128       27,801      9,715     15,387
> > > write_only/32        53,146     18,116     31,460
> > > write_only/64        57,628     19,589     32,875
> > > write_only/128       59,454     14,857     33,814
> > >
> > > Avg latency (ms):
> > >
> > >                      FPW OFF    FPW ON     DWB ON
> > > read_write/32          1.77       4.03       2.46
> > > read_write/64          2.64       6.71       4.16
> > > read_write/128         4.60      13.17       9.81
> > > write_only/32          0.60       1.77       1.02
> > > write_only/64          1.11       3.27       1.95
> > > write_only/128         2.15       8.61       3.78
> > >
> > > FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
> > > write-heavy scenarios DWB delivers over 2x the throughput of FPW with
> > > significantly better latency.
> > >
> > > The implementation is here: https://github.com/baotiao/postgres
> > >
> > > I'd appreciate any feedback on the approach. Would be great if the
> > > community could take a look and see if this direction is worth
> > > pursuing upstream.
> >
> > Hi Baotiao
> >
> > I'm a newbie here, but took Your idea with some interest, probably
> everyone
> > else is busy with work on other patches before commit freeze.
> >
>
> I'm somewhat less of a noob here, so I'll confirm that this proposal
> has basically zero chance of getting in, at least for the v19 cycle.
> This isn't so much about the proposal itself, but more in that if you
> were trying to pick the worst time of year to submit a large,
> complicated feature into the postgresql workflow, this would be really
> close to that.
>
> However, I have also wondered about this specific trade-off (FPW vs
> DWB) for years, but until now, the level of effort required to produce
> a meaningful POC that would confirm if the idea was worth pursuing was
> so large that I think it stopped anyone from even trying. So,
> hopefully everyone will realize that we don't live in that world
> anymore, and as a side benefit, apparently the idea is worth pursuing.
>
> > I think it would be valuable to have this as I've been hit by
> PostgreSQL's
> > unsteady (chain-saw-like) WAL traffic, especially related to touching
> 1st the
> > pages after checkpoint, up to the point of saturating network links. The
> common
> > counter-argument to double buffering is probably that FPI may(?)
> increase WAL
> > standby replication rate and this would have to be measured into account
> > (but we also should take into account how much
> maintenance_io_concurrency/
> > posix_fadvise() prefetching that we do today helps avoid any I/O stalls
> on
> > fetching pages - so it should be basically free), I see even that you
> > got benefits
> > by not using FPI. Interesting.
> >
> > Some notes/questions about the patches itself:
> >
>
> So, I haven't looked at the code itself; tbh honest I am a bit too
> paranoid to dive into generated code that would seem to carry some
> likely level of legal risk around potential reuse of GPL/proprietary
> code it might be based on (either in its original training, inference,
> or context used for generation. Yeah, I know innodb isn't written in
> C, but still). That said, I did have some feedback and questions on
> the proposal itself, and some suggestions for how to move things
> forward.
>
> I would be helpful if you could provide a little more information on
> the system you are running these benchmarks on, specifically for me
> the underlying OS/Filesystem/hardware, and I'd even be interested in
> the build flags. I'd also be interested to know if you did any kind of
> crash safety testing... while it is great to have improved
> performance, presumably that isn't actually the primary point of these
> subsystems. It'd also be worth knowing if you tested this on any
> systems with replication (physical or logical) since we'd need to
> understand those potential downstream effects. I'm tempted to say you
> should have an AI generate some pgbench scripts. Granted its early and
> fine if you have done any of this, but I imagine we'll need to look at
> it eventually.
>
> > 0. The convention here is send the patches using:
> >    git format-patch -v<VERSION> HEAD~<numberOfpatches>
> >    for easier review. The 0003 probably should be out of scope. Anyway
> I've
> >    attached all of those so maybe somebody else is going to take a
> > look at them too,
> >    they look very mature. Is this code used in production already
> anywhere? (and
> >    BTW the numbers are quite impressive)
> >
>
> While Jakub is right that the convention is to send patches, that
> convention is based on a manual development model, not an agentic
> development model. While there is no official project policy on this,
> IMHO the thing we really need from you is not the code output, but the
> prompts that were used to generate the code. There are plenty of folks
> who have access to claude that could then use those prompts to
> "recreate with enough proximity" the work you had claude do, and that
> process would also allow for additional verification and reduction of
> any legal concerns or concerns about investing further human
> time/energy. (No offense, but as you are not a regular contributor,
> you could analogize this to when third parties do large code dumps and
> say "here's a contribution, it's up to you to figure out how to use
> it". Ideally we want other folks to be able to pick up the project and
> continue with it, even if it means recreating it, and that works best
> if we have the underlying prompts).
>
> The claude code configuration file is a good start, but certainly not
> enough. Probably the ideal here would be full session logs, although a
> developer-diary would probably also suffice. I'm kind of guessing here
> because I don't know the scope of the prompts involved or how you were
> interacting with Claude in order to get where you are now, but those
> seem like the more obvious tools for work of this size whose intention
> is to be open.
>
>
> Robert Treat
> https://xzilla.net
>
>
>

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Reply via email to