Re: Streamify more code paths

Xuneng Zhou Mon, 09 Mar 2026 23:06:51 -0700

Hi,

On Mon, Feb 9, 2026 at 6:40 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi,
>
> On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi,
> >
> > On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <[email protected]> 
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <[email protected]> 
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > > >
> > > > > > > Two more to go:
> > > > > > > patch 5: Streamify log_newpage_range() WAL logging path
> > > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads
> > > > > > >
> > > > > > > Benchmarks will be conducted soon.
> > > > > > >
> > > > > >
> > > > > > v6 in the last message has a problem and has not been updated. 
> > > > > > Attach
> > > > > > the right one again. Sorry for the noise.
> > > > >
> > > > > 0003 and 0006:
> > > > >
> > > > > You need to add 'StatApproxReadStreamPrivate' and
> > > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list.
> > > >
> > > > Done.
> > > >
> > > > > 0005:
> > > > >
> > > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber 
> > > > > forknum,
> > > > >          nbufs = 0;
> > > > >          while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
> > > > >          {
> > > > > -            Buffer        buf = ReadBufferExtended(rel, forknum, 
> > > > > blkno,
> > > > > -                                                 RBM_NORMAL, NULL);
> > > > > +            Buffer        buf = read_stream_next_buffer(stream, 
> > > > > NULL);
> > > > > +
> > > > > +            if (!BufferIsValid(buf))
> > > > > +                break;
> > > > >
> > > > > We are loosening a check here, there should not be a invalid buffer in
> > > > > the stream until the endblk. I think you can remove this
> > > > > BufferIsValid() check, then we can learn if something goes wrong.
> > > >
> > > > My concern before for not adding assert at the end of streaming is the
> > > > potential early break in here:
> > > >
> > > > /* Nothing more to do if all remaining blocks were empty. */
> > > > if (nbufs == 0)
> > > >     break;
> > > >
> > > > After looking more closely, it turns out to be a misunderstanding of 
> > > > the logic.
> > > >
> > > > > 0006:
> > > > >
> > > > > You can use read_stream_reset() instead of read_stream_end(), then you
> > > > > can use the same stream with different variables, I believe this is
> > > > > the preferred way.
> > > > >
> > > > > Rest LGTM!
> > > > >
> > > >
> > > > Yeah, reset seems a more proper way here.
> > > >
> > >
> > > Run pgindent using the updated typedefs.list.
> > >
> >
> > I've completed benchmarking of the v4 streaming read patches across
> > three I/O methods (io_uring, sync, worker). Tests were run with cold
> > cache on large datasets.
> >
> > --- Settings ---
> >
> > shared_buffers = '8GB'
> > effective_io_concurrency = 200
> > io_method = $IO_METHOD
> > io_workers = $IO_WORKERS
> > io_max_concurrency = $IO_MAX_CONCURRENCY
> > track_io_timing = on
> > autovacuum = off
> > checkpoint_timeout = 1h
> > max_wal_size = 10GB
> > max_parallel_workers_per_gather = 0
> >
> > --- Machine ---
> > CPU: 48-core
> > RAM: 256 GB DDR5
> > Disk: 2 x 1.92 TB NVMe SSD
> >
> > --- Executive Summary ---
> >
> > The patches provide significant benefits for I/O-bound sequential
> > operations, with the greatest improvements seen when using
> > asynchronous I/O methods (io_uring and worker). The synchronous I/O
> > mode shows reduced but still meaningful gains.
> >
> > --- Results by I/O Method
> >
> > Best Results: io_method=worker
> >
> > bloom_scan: 4.14x (75.9% faster); 93% fewer reads
> > pgstattuple: 1.59x (37.1% faster); 94% fewer reads
> > hash_vacuum: 1.05x (4.4% faster); 80% fewer reads
> > gin_vacuum: 1.06x (5.6% faster); 15% fewer reads
> > bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads
> > wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads
> >
> > io_method=io_uring
> >
> > bloom_scan: 3.12x (68.0% faster); 93% fewer reads
> > pgstattuple: 1.50x (33.2% faster); 94% fewer reads
> > hash_vacuum: 1.03x (3.3% faster); 80% fewer reads
> > gin_vacuum: 1.02x (2.1% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads
> > wal_logging: 1.00x (-0.5%, neutral); no change in reads
> >
> > io_method=sync (baseline comparison)
> >
> > bloom_scan: 1.20x (16.4% faster); 93% fewer reads
> > pgstattuple: 1.10x (9.0% faster); 94% fewer reads
> > hash_vacuum: 1.01x (0.8% faster); 80% fewer reads
> > gin_vacuum: 1.02x (1.7% faster); 15% fewer reads
> > bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads
> > wal_logging: 0.99x (-0.7%, neutral); no change in reads
> >
> > --- Observations ---
> >
> > Async I/O amplifies streaming benefits: The same patches show 3-4x
> > improvement with worker/io_uring vs 1.2x with sync.
> >
> > I/O operation reduction is consistent: All modes show the same ~93-94%
> > reduction in I/O operations for bloom_scan and pgstattuple.
> >
> > VACUUM operations show modest gains: Despite large I/O reductions
> > (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has
> > larger CPU overhead (tuple processing, index maintenance, WAL
> > logging).
> >
> > log_newpage_range shows no benefit: The patch provides no improvement 
> > (~0.97x).
> >
> > --
> > Best,
> > Xuneng
>
> There was an issue in the wal_log test of the original script.
>
> --- The original benchmark used:
> ALTER TABLE ... SET LOGGED
>
> This path performs a full table rewrite via ATRewriteTable()
> (tablecmds.c). It creates a new relfilenode and copies tuples into it.
> It does not call log_newpage_range() on rewritten pages.
>
> log_newpage_range() may only appear indirectly through the
> pending-sync logic in storage.c, and only when:
>
> wal_level = minimal, and
> relation size < wal_skip_threshold (default 2MB).
>
> Our test tables (1M–20M rows) are far larger than 2MB. In that case,
> PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the
> previous benchmark measured table rewrite I/O, not the
> log_newpage_range() path.
>
> --- Current design: GIN index build
>
> The benchmark now uses:
> CREATE INDEX ... USING gin (doc_tsv)
>
> This reliably exercises log_newpage_range() because:
> - ginbuild() constructs the index and WAL-logs all new index pages
> using log_newpage_range().
> - This is part of the normal GIN build path, independent of 
> wal_skip_threshold.
> - The streaming-read patch modifies the WAL logging path inside
> log_newpage_range(), which this test directly targets.
>
> --- Results (wal_logging_large)
> worker: 1.00x (+0.5%); no meaningful change in reads
> io_uring: 1.01x (+1.3%); no meaningful change in reads
> sync: 1.01x (+1.1%); no meaningful change in reads
>
> --
> Best,
> Xuneng


Here’s v5 of the patchset. The wal_logging_large patch has been
removed, as no performance gains were observed in the benchmark runs.

-- 
Best,
Xuneng

v5-0001-Switch-Bloom-scan-paths-to-streaming-read.patch
Description: Binary data

v5-0003-Streamify-heap-bloat-estimation-scan.-Introduce-a.patch
Description: Binary data

v5-0004-Replace-synchronous-ReadBufferExtended-loop-with-.patch
Description: Binary data

v5-0002-Streamify-Bloom-VACUUM-paths.patch
Description: Binary data

v5-0005-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
Description: Binary data

Re: Streamify more code paths

Reply via email to