Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

陈宗志 Fri, 27 Feb 2026 11:27:10 -0800

Hi,

> (if an instance can recover and start up within 15 seconds) This
> depends on the volume of write operations and whether MySQL uses
> incremental checkpoints.The experiences with the two different
> databases vary. PostgreSQL incorporates numerous optimizations to
> mitigate FPW, such as enabling wal_compression as mentioned earlier,
> and reducing the WAL log size through vacuum .
> Another thing ，Did rds set innodb_redo_log_capacity to the minimum
> value?


In our production RDS environment, we generally do not enable
wal_compression. Compressing data during write operations can
introduce CPU overhead and latency spikes (jitter). For our production
services, maintaining stable and predictable performance is much more
important than the I/O savings. Of course, as my previous benchmark
results demonstrated, I fully agree that wal_compression provides a
significant performance boost when FPW is enabled.

Regarding MySQL's behavior, it predominantly relies on incremental
checkpoints (also known as fuzzy checkpoints) during normal operation.
It only flushes all dirty pages to disk during a clean, safe shutdown.

As for whether we set innodb_redo_log_capacity to the minimum value
in production: no, we do not. As I mentioned in my previous email,
the aggressive configuration in my test was purely to simulate an
environment that triggers higher frequency checkpoints to strictly
bound the crash recovery time. In a real-world PostgreSQL deployment,
we could theoretically achieve a similar effect by lowering the
checkpoint_timeout, if the system allowed configuring it below the
current 30-second minimum.

Regards,
Baotiao

On Fri, Feb 27, 2026 at 10:36 AM wenhui qiu <[email protected]> wrote:
>
> HI
> > Regarding this point, we actually have a specific strategy in our
> > production environment: if an instance can recover and start up within
> > 15 seconds, we do not trigger an HA switchover. Furthermore, there are
> > many single-node instances deployed that do not have HA capabilities
> > at all. Therefore, optimizing the crash recovery time (which DWB helps
> > with by reducing WAL replay volume) remains highly valuable in
> > real-world deployments.
> ( if an instance can recover and start up within 15 seconds) This depends on 
> the volume of write operations and whether MySQL uses incremental 
> checkpoints.The experiences with the two different databases vary. PostgreSQL 
> incorporates numerous optimizations to mitigate FPW, such as enabling 
> wal_compression as mentioned earlier, and reducing the WAL log size through 
> vacuum .   Another thing ，Did rds set innodb_redo_log_capacity to the minimum 
> value?
>
>
> Thanks
>
> On Fri, Feb 27, 2026 at 3:32 AM 陈宗志 <[email protected]> wrote:
>>
>> Hi wenhui
>>
>> I have carefully read through the email thread you linked. Here is a
>> summary of my takeaways from that discussion:
>>
>> 1. It provided a DWB demo patch based on a very old PostgreSQL version
>>    and tested it in certain scenarios. The performance comparison
>>    between DWB and FPW was inconclusive across different
>>    shared_buffers sizes.
>> 2. The patch caused severe performance degradation in COPY scenarios.
>> 3. It discussed shared_buffers sizing strategies (e.g., setting it to
>>    25% of RAM). Since the original author didn't follow this best
>>    practice, there is a lot of room for interpreting their performance
>>    results. In our benchmarks, we fix it at 25%, so this isn't a
>>    concern for us.
>> 4. It discussed whether BufferAccessStrategy should be used with DWB.
>>    (The intent of the strategy is to restrict a single process to a
>>    portion of shared_buffers to avoid impacting others, but with DWB,
>>    this might cause excessively frequent flushing and degrade
>>    performance).
>> 5. Besides normal read/write workloads, it noted the need to consider
>>    bulk loads, SELECT on large unhinted tables, vacuum speed,
>>    checkpoint duration, and others to prevent severe regressions in
>>    these areas.
>> 6. Double-write cannot replace FPW during backups.
>>
>> From reading that thread, it seems the general concept of a DWB is
>> actually quite acceptable to the community; it's just that no one has
>> invested enough effort to fully solve all these edge cases yet.
>>
>> > waiting for the primary to recover instead of performing an HA
>> > switchover is generally considered unacceptable.
>>
>> Regarding this point, we actually have a specific strategy in our
>> production environment: if an instance can recover and start up within
>> 15 seconds, we do not trigger an HA switchover. Furthermore, there are
>> many single-node instances deployed that do not have HA capabilities
>> at all. Therefore, optimizing the crash recovery time (which DWB helps
>> with by reducing WAL replay volume) remains highly valuable in
>> real-world deployments.
>>
>> Regards,
>> Baotiao
>>
>> On Thu, Feb 19, 2026 at 4:19 PM wenhui qiu <[email protected]> wrote:
>> >
>> > HI Robert
>> >      In fact, this was discussed over a decade 
>> > ago.（https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com
>> >  ）,In practice, this mainly stems from the significant overhead introduced 
>> > by FPW. The community has adopted various approaches to mitigate its 
>> > impact, including compressing FPW and extending checkpoint intervals. 
>> > Nowadays, waiting for the primary to recover instead of performing an HA 
>> > switchover is generally considered unacceptable.
>> >
>> >
>> > Thanks
>> >
>> > On Thu, Feb 19, 2026 at 2:00 AM Robert Treat <[email protected]> wrote:
>> >>
>> >> On Mon, Feb 16, 2026 at 9:07 AM Jakub Wartak
>> >> <[email protected]> wrote:
>> >> > On Mon, Feb 9, 2026 at 7:53 PM 陈宗志 <[email protected]> wrote:
>> >> > >
>> >> > > Hi hackers,
>> >> > >
>> >> > > I raised this topic a while back [1] but didn't get much traction, so
>> >> > > I went ahead and implemented it: a doublewrite buffer (DWB) mechanism
>> >> > > for PostgreSQL as an alternative to full_page_writes.
>> >> > >
>> >> > > The core argument is straightforward. FPW and checkpoint frequency are
>> >> > > fundamentally at odds:
>> >> > >
>> >> > > - FPW wants fewer checkpoints -- each checkpoint triggers a wave of
>> >> > > full-page WAL writes for every page dirtied for the first time,
>> >> > > bloating WAL and tanking write throughput.
>> >> > > - Fast crash recovery wants more checkpoints -- less WAL to replay
>> >> > > means the database comes back sooner.
>> >> > >
>> >> > > DWB resolves this tension by moving torn page protection out of the
>> >> > > WAL path entirely. Instead of writing full pages into WAL (foreground,
>> >> > > latency-sensitive), dirty pages are sequentially written to a
>> >> > > dedicated doublewrite buffer area on disk before being flushed to
>> >> > > their actual locations. The buffer is fsync'd once when full, then
>> >> > > pages are scatter-written to their final positions. On crash recovery,
>> >> > > intact copies from the DWB repair any torn pages.
>> >> > >
>> >> > > Key design differences:
>> >> > >
>> >> > > - FPW: 1 WAL write (foreground) + 1 page write = directly impacts SQL 
>> >> > > latency
>> >> > > - DWB: 2 page writes (background flush path) = minimal user-visible 
>> >> > > impact
>> >> > > - DWB batches fsync() across multiple pages; WAL fsync batching is
>> >> > > limited by foreground latency constraints
>> >> > > - DWB decouples torn page protection from checkpoint frequency, so you
>> >> > > can checkpoint as often as you want without write amplification
>> >> > >
>> >> > > I ran sysbench benchmarks (io-bound, --tables=10
>> >> > > --table_size=10000000) with checkpoint_timeout=30s,
>> >> > > shared_buffers=4GB, synchronous_commit=on. Each scenario uses a fresh
>> >> > > database, VACUUM FULL, 60s warmup, 300s run.
>> >> > >
>> >> > > Results (TPS):
>> >> > >
>> >> > >                      FPW OFF    FPW ON     DWB ON
>> >> > > read_write/32        18,038      7,943     13,009
>> >> > > read_write/64        24,249      9,533     15,387
>> >> > > read_write/128       27,801      9,715     15,387
>> >> > > write_only/32        53,146     18,116     31,460
>> >> > > write_only/64        57,628     19,589     32,875
>> >> > > write_only/128       59,454     14,857     33,814
>> >> > >
>> >> > > Avg latency (ms):
>> >> > >
>> >> > >                      FPW OFF    FPW ON     DWB ON
>> >> > > read_write/32          1.77       4.03       2.46
>> >> > > read_write/64          2.64       6.71       4.16
>> >> > > read_write/128         4.60      13.17       9.81
>> >> > > write_only/32          0.60       1.77       1.02
>> >> > > write_only/64          1.11       3.27       1.95
>> >> > > write_only/128         2.15       8.61       3.78
>> >> > >
>> >> > > FPW ON drops to ~25% of baseline (FPW OFF). DWB ON holds at ~57%. In
>> >> > > write-heavy scenarios DWB delivers over 2x the throughput of FPW with
>> >> > > significantly better latency.
>> >> > >
>> >> > > The implementation is here: https://github.com/baotiao/postgres
>> >> > >
>> >> > > I'd appreciate any feedback on the approach. Would be great if the
>> >> > > community could take a look and see if this direction is worth
>> >> > > pursuing upstream.
>> >> >
>> >> > Hi Baotiao
>> >> >
>> >> > I'm a newbie here, but took Your idea with some interest, probably 
>> >> > everyone
>> >> > else is busy with work on other patches before commit freeze.
>> >> >
>> >>
>> >> I'm somewhat less of a noob here, so I'll confirm that this proposal
>> >> has basically zero chance of getting in, at least for the v19 cycle.
>> >> This isn't so much about the proposal itself, but more in that if you
>> >> were trying to pick the worst time of year to submit a large,
>> >> complicated feature into the postgresql workflow, this would be really
>> >> close to that.
>> >>
>> >> However, I have also wondered about this specific trade-off (FPW vs
>> >> DWB) for years, but until now, the level of effort required to produce
>> >> a meaningful POC that would confirm if the idea was worth pursuing was
>> >> so large that I think it stopped anyone from even trying. So,
>> >> hopefully everyone will realize that we don't live in that world
>> >> anymore, and as a side benefit, apparently the idea is worth pursuing.
>> >>
>> >> > I think it would be valuable to have this as I've been hit by 
>> >> > PostgreSQL's
>> >> > unsteady (chain-saw-like) WAL traffic, especially related to touching 
>> >> > 1st the
>> >> > pages after checkpoint, up to the point of saturating network links. 
>> >> > The common
>> >> > counter-argument to double buffering is probably that FPI may(?) 
>> >> > increase WAL
>> >> > standby replication rate and this would have to be measured into account
>> >> > (but we also should take into account how much 
>> >> > maintenance_io_concurrency/
>> >> > posix_fadvise() prefetching that we do today helps avoid any I/O stalls 
>> >> > on
>> >> > fetching pages - so it should be basically free), I see even that you
>> >> > got benefits
>> >> > by not using FPI. Interesting.
>> >> >
>> >> > Some notes/questions about the patches itself:
>> >> >
>> >>
>> >> So, I haven't looked at the code itself; tbh honest I am a bit too
>> >> paranoid to dive into generated code that would seem to carry some
>> >> likely level of legal risk around potential reuse of GPL/proprietary
>> >> code it might be based on (either in its original training, inference,
>> >> or context used for generation. Yeah, I know innodb isn't written in
>> >> C, but still). That said, I did have some feedback and questions on
>> >> the proposal itself, and some suggestions for how to move things
>> >> forward.
>> >>
>> >> I would be helpful if you could provide a little more information on
>> >> the system you are running these benchmarks on, specifically for me
>> >> the underlying OS/Filesystem/hardware, and I'd even be interested in
>> >> the build flags. I'd also be interested to know if you did any kind of
>> >> crash safety testing... while it is great to have improved
>> >> performance, presumably that isn't actually the primary point of these
>> >> subsystems. It'd also be worth knowing if you tested this on any
>> >> systems with replication (physical or logical) since we'd need to
>> >> understand those potential downstream effects. I'm tempted to say you
>> >> should have an AI generate some pgbench scripts. Granted its early and
>> >> fine if you have done any of this, but I imagine we'll need to look at
>> >> it eventually.
>> >>
>> >> > 0. The convention here is send the patches using:
>> >> >    git format-patch -v<VERSION> HEAD~<numberOfpatches>
>> >> >    for easier review. The 0003 probably should be out of scope. Anyway 
>> >> > I've
>> >> >    attached all of those so maybe somebody else is going to take a
>> >> > look at them too,
>> >> >    they look very mature. Is this code used in production already 
>> >> > anywhere? (and
>> >> >    BTW the numbers are quite impressive)
>> >> >
>> >>
>> >> While Jakub is right that the convention is to send patches, that
>> >> convention is based on a manual development model, not an agentic
>> >> development model. While there is no official project policy on this,
>> >> IMHO the thing we really need from you is not the code output, but the
>> >> prompts that were used to generate the code. There are plenty of folks
>> >> who have access to claude that could then use those prompts to
>> >> "recreate with enough proximity" the work you had claude do, and that
>> >> process would also allow for additional verification and reduction of
>> >> any legal concerns or concerns about investing further human
>> >> time/energy. (No offense, but as you are not a regular contributor,
>> >> you could analogize this to when third parties do large code dumps and
>> >> say "here's a contribution, it's up to you to figure out how to use
>> >> it". Ideally we want other folks to be able to pick up the project and
>> >> continue with it, even if it means recreating it, and that works best
>> >> if we have the underlying prompts).
>> >>
>> >> The claude code configuration file is a good start, but certainly not
>> >> enough. Probably the ideal here would be full session logs, although a
>> >> developer-diary would probably also suffice. I'm kind of guessing here
>> >> because I don't know the scope of the prompts involved or how you were
>> >> interacting with Claude in order to get where you are now, but those
>> >> seem like the more obvious tools for work of this size whose intention
>> >> is to be open.
>> >>
>> >>
>> >> Robert Treat
>> >> https://xzilla.net
>> >>
>> >>
>>
>>
>> --
>> ---
>> Blog: https://baotiao.github.io/
>> Twitter: https://twitter.com/baotiao
>> Git: https://github.com/baotiao



-- 
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Reply via email to