Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

wenhui qiu Wed, 25 Feb 2026 23:32:59 -0800

Hi
Why have you selectively ignored this parameter?
wal_compression = lz4   # enables compression of full-page writes;
                                        # off, pglz, lz4, zstd, or on



Thanks

On Thu, Feb 26, 2026 at 3:14 PM 陈宗志 <[email protected]> wrote:

> Hi Robert,
>
> Thanks for the feedback and suggestions.
>
> > I'm somewhat less of a noob here, so I'll confirm that this proposal
> > has basically zero chance of getting in, at least for the v19 cycle.
> > This isn't so much about the proposal itself, but more in that if you
> > were trying to pick the worst time of year to submit a large,
> > complicated feature into the postgresql workflow, this would be really
> > close to that.
> > However, I have also wondered about this specific trade-off (FPW vs
> > DWB) for years, but until now, the level of effort required to produce
> > a meaningful POC that would confirm if the idea was worth pursuing was
> > so large that I think it stopped anyone from even trying. So,
> > hopefully everyone will realize that we don't live in that world
> > anymore, and as a side benefit, apparently the idea is worth pursuing.
>
> I completely understand, and I actually have no intention of pushing
> this patch for the v19 cycle. My primary goal right now is simply to
> share the POC results and discuss the idea with the community to see
> if this direction is worth pursuing.
>
> For context, I have been a MySQL InnoDB developer for over 10 years,
> but I admit I am a newcomer to the PostgreSQL community, so I am still
> familiarizing myself with the standard workflow and processes here.
>
> >> I think it would be valuable to have this as I've been hit by
> PostgreSQL's
> >> unsteady (chain-saw-like) WAL traffic, especially related to touching
> 1st the
> >> pages after checkpoint, up to the point of saturating network links.
> The common
> >> counter-argument to double buffering is probably that FPI may(?)
> increase WAL
> >> standby replication rate and this would have to be measured into account
> >> (but we also should take into account how much
> maintenance_io_concurrency/
> >> posix_fadvise() prefetching that we do today helps avoid any I/O stalls
> on
> >> fetching pages - so it should be basically free), I see even that you
> >> got benefits
> >> by not using FPI. Interesting.
> >>
> >> Some notes/questions about the patches itself:
> >>
> > So, I haven't looked at the code itself; tbh honest I am a bit too
> > paranoid to dive into generated code that would seem to carry some
> > likely level of legal risk around potential reuse of GPL/proprietary
> > code it might be based on (either in its original training, inference,
> > or context used for generation. Yeah, I know innodb isn't written in
> > C, but still). That said, I did have some feedback and questions on
> > the proposal itself, and some suggestions for how to move things
> > forward.
> >
> >> 0. The convention here is send the patches using:
> >>    git format-patch -v<VERSION> HEAD~<numberOfpatches>
> >>    for easier review. The 0003 probably should be out of scope. Anyway
> I've
> >>    attached all of those so maybe somebody else is going to take a
> >> look at them too,
> >>    they look very mature. Is this code used in production already
> anywhere? (and
> >>    BTW the numbers are quite impressive)
> >>
> > While Jakub is right that the convention is to send patches, that
> > convention is based on a manual development model, not an agentic
> > development model. While there is no official project policy on this,
> > IMHO the thing we really need from you is not the code output, but the
> > prompts that were used to generate the code. There are plenty of folks
> > who have access to claude that could then use those prompts to
> > "recreate with enough proximity" the work you had claude do, and that
> > process would also allow for additional verification and reduction of
> > any legal concerns or concerns about investing further human
> > time/energy. (No offense, but as you are not a regular contributor,
> > you could analogize this to when third parties do large code dumps and
> > say "here's a contribution, it's up to you to figure out how to use
> > it". Ideally we want other folks to be able to pick up the project and
> > continue with it, even if it means recreating it, and that works best
> > if we have the underlying prompts).
> > The claude code configuration file is a good start, but certainly not
> > enough. Probably the ideal here would be full session logs, although a
> > developer-diary would probably also suffice. I'm kind of guessing here
> > because I don't know the scope of the prompts involved or how you were
> > interacting with Claude in order to get where you are now, but those
> > seem like the more obvious tools for work of this size whose intention
> > is to be open.
>
> Regarding the AI-generated code, the raw output from Claude was far
> from perfect. I have manually reviewed and modified the code
> extensively to get it to this state.
>
> Our plan is to first deploy and test this thoroughly on our own
> product, Alibaba Cloud RDS for PostgreSQL. Once we are confident that
> it is stable and issue-free, we intend to submit a formalized patch
> to the community. I am very much looking forward to discussing and
> reviewing the actual code with you all when the time comes.
>
> As for sharing the prompts or session logs, I personally feel they
> might not be as valuable as the final code itself. The generation
> process involved a lot of iterative, back-and-forth communication;
> the AI only knew how to make the right modifications after continuous
> human guidance, correction, and architectural decisions.
>
> > I would be helpful if you could provide a little more information on
> > the system you are running these benchmarks on, specifically for me
> > the underlying OS/Filesystem/hardware, and I'd even be interested in
> > the build flags. I'd also be interested to know if you did any kind of
> > crash safety testing... while it is great to have improved
> > performance, presumably that isn't actually the primary point of these
> > subsystems. It'd also be worth knowing if you tested this on any
> > systems with replication (physical or logical) since we'd need to
> > understand those potential downstream effects. I'm tempted to say you
> > should have an AI generate some pgbench scripts. Granted its early and
> > fine if you have done any of this, but I imagine we'll need to look at
> > it eventually.
>
> I have addressed the feedback and conducted comprehensive benchmarks
> comparing the three io_torn_pages_protection modes. Here are the
> detailed performance results and the system setup information you
> requested.
>
> Benchmark Setup:
> - Hardware: x86_64, Linux 5.10, NVMe SSD
> - PostgreSQL: 19devel (with DWB patch applied)
> - Tool: pgbench (TPC-B), 64 clients, 8 threads, 60 seconds per run
> - Common config: shared_buffers = 1GB, wal_level = replica
> - Three modes tested:
>   * io_torn_pages_protection = full_pages (traditional FPW)
>   * io_torn_pages_protection = double_writes (DWB size = 128MB)
>   * io_torn_pages_protection = off (no protection, baseline)
>
> Each test was run sequentially on the same machine to avoid I/O
> contention.
>
> Test 1: pgbench scale=100 (~1.5GB dataset), max_wal_size = 10GB
>
> With infrequent checkpoints, FPW overhead is minimal and all three
> modes perform similarly:
>
> Mode            TPS         Latency(avg)  WAL Size   FPI Count   FPI Size
> -----------  ----------  ------------  ---------  ----------  ---------
> full_pages      103,290      0.610 ms      3,903 MB    191,341   1,456 MB
> double_writes   104,088      0.606 ms      2,475 MB          0          0
> off             104,622      0.602 ms      2,510 MB          0          0
>
> DWB vs FPW: +0.8% TPS, WAL reduced by 36.6%.
>
> Test 2: pgbench scale=100 (~1.5GB dataset), max_wal_size = 64MB
>
> With frequent checkpoints (triggered every ~64MB of WAL), the FPW write
> amplification becomes severe:
>
> Mode            TPS         Latency(avg)  WAL Size   FPI Count    FPI Size
> -----------  ----------  ------------  ---------  -----------  ---------
> full_pages       54,324      1.171 ms      29 GB    3,806,504      28 GB
> double_writes    93,942      0.672 ms     2,303 MB          2      16 kB
> off             108,901      0.578 ms     2,746 MB          0          0
>
> DWB vs FPW: +72.9% TPS, latency reduced by 42.6%, WAL reduced by 92.2%.
>
> Test 3: pgbench scale=10 (~150MB dataset), max_wal_size = 64MB
>
> Even with a smaller dataset that fits in shared_buffers, the advantage
> is clear:
>
> Mode            TPS         Latency(avg)  WAL Size   FPI Count    FPI Size
> -----------  ----------  ------------  ---------  -----------  ---------
> full_pages       33,707      1.895 ms      15 GB    2,010,439      14 GB
> double_writes    43,743      1.459 ms     1,140 MB          0          0
> off              43,982      1.452 ms     1,150 MB          0          0
>
> DWB vs FPW: +29.8% TPS, WAL reduced by 92.6%.
>
> Test 4: sysbench oltp_write_only (10 tables x 100K rows), 64 threads, 30s
>
>                     max_wal_size=10GB          max_wal_size=64MB
>                    --------------------       --------------------
> Mode               TPS       FPI Count        TPS       FPI Count
> -----------      --------   ----------      --------   ----------
> full_pages       138,019        32,253       40,187    3,455,253
> double_writes    136,642             0      133,034            0
>
> With large max_wal_size: DWB and FPW perform equally (-1.0%).
> With small max_wal_size: DWB is +231% faster (3.3x).
>
> Analysis:
>
> The key factor is checkpoint frequency. FPW must write a full 8KB page
> image to WAL for every page's first modification after each checkpoint.
> When checkpoints are frequent:
>
> - FPI count explodes (32K -> 3.8M with scale=100)
> - WAL becomes dominated by FPI data (28GB out of 29GB = 96.6%)
> - This creates massive write amplification on the WAL path
>
> DWB avoids this entirely. Its WAL size stays constant regardless of
> checkpoint frequency (~2.3GB vs FPW's 29GB). The double-write buffer
> itself wrote 42GB in the heavy test, but since it uses sequential writes
> with batched fsync, the overhead is modest — DWB achieves 86.3% of the
> "no protection" baseline (93,942 vs 108,901 TPS).
>
> When does DWB matter most?
> - Large active datasets that exceed shared_buffers
> - Frequent checkpoints (small max_wal_size or short checkpoint_timeout)
> - Write-heavy workloads
> - Replication scenarios where WAL volume directly impacts network
>
> In production environments where max_wal_size is often set
> conservatively (e.g., 1GB) and datasets are much larger than
> shared_buffers, DWB should provide significant and consistent benefits
> over FPW. As for the crash safety testing you mentioned, it is on our
> roadmap as we continue to refine the patch for our internal RDS
> deployment.
>
> Regards,
> Baotiao
>
>
>

Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write

Reply via email to