Hi, On Mon, Feb 9, 2026 at 6:40 PM Xuneng Zhou <[email protected]> wrote: > > Hi, > > On Thu, Feb 5, 2026 at 12:01 PM Xuneng Zhou <[email protected]> wrote: > > > > Hi, > > > > On Tue, Dec 30, 2025 at 10:43 AM Xuneng Zhou <[email protected]> wrote: > > > > > > Hi, > > > > > > On Tue, Dec 30, 2025 at 9:51 AM Xuneng Zhou <[email protected]> wrote: > > > > > > > > Hi, > > > > > > > > Thanks for looking into this. > > > > > > > > On Mon, Dec 29, 2025 at 6:58 PM Nazir Bilal Yavuz <[email protected]> > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Sun, 28 Dec 2025 at 14:46, Xuneng Zhou <[email protected]> > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > Two more to go: > > > > > > > patch 5: Streamify log_newpage_range() WAL logging path > > > > > > > patch 6: Streamify hash index VACUUM primary bucket page reads > > > > > > > > > > > > > > Benchmarks will be conducted soon. > > > > > > > > > > > > > > > > > > > v6 in the last message has a problem and has not been updated. > > > > > > Attach > > > > > > the right one again. Sorry for the noise. > > > > > > > > > > 0003 and 0006: > > > > > > > > > > You need to add 'StatApproxReadStreamPrivate' and > > > > > 'HashBulkDeleteStreamPrivate' to the typedefs.list. > > > > > > > > Done. > > > > > > > > > 0005: > > > > > > > > > > @@ -1321,8 +1341,10 @@ log_newpage_range(Relation rel, ForkNumber > > > > > forknum, > > > > > nbufs = 0; > > > > > while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk) > > > > > { > > > > > - Buffer buf = ReadBufferExtended(rel, forknum, > > > > > blkno, > > > > > - RBM_NORMAL, NULL); > > > > > + Buffer buf = read_stream_next_buffer(stream, > > > > > NULL); > > > > > + > > > > > + if (!BufferIsValid(buf)) > > > > > + break; > > > > > > > > > > We are loosening a check here, there should not be a invalid buffer in > > > > > the stream until the endblk. I think you can remove this > > > > > BufferIsValid() check, then we can learn if something goes wrong. > > > > > > > > My concern before for not adding assert at the end of streaming is the > > > > potential early break in here: > > > > > > > > /* Nothing more to do if all remaining blocks were empty. */ > > > > if (nbufs == 0) > > > > break; > > > > > > > > After looking more closely, it turns out to be a misunderstanding of > > > > the logic. > > > > > > > > > 0006: > > > > > > > > > > You can use read_stream_reset() instead of read_stream_end(), then you > > > > > can use the same stream with different variables, I believe this is > > > > > the preferred way. > > > > > > > > > > Rest LGTM! > > > > > > > > > > > > > Yeah, reset seems a more proper way here. > > > > > > > > > > Run pgindent using the updated typedefs.list. > > > > > > > I've completed benchmarking of the v4 streaming read patches across > > three I/O methods (io_uring, sync, worker). Tests were run with cold > > cache on large datasets. > > > > --- Settings --- > > > > shared_buffers = '8GB' > > effective_io_concurrency = 200 > > io_method = $IO_METHOD > > io_workers = $IO_WORKERS > > io_max_concurrency = $IO_MAX_CONCURRENCY > > track_io_timing = on > > autovacuum = off > > checkpoint_timeout = 1h > > max_wal_size = 10GB > > max_parallel_workers_per_gather = 0 > > > > --- Machine --- > > CPU: 48-core > > RAM: 256 GB DDR5 > > Disk: 2 x 1.92 TB NVMe SSD > > > > --- Executive Summary --- > > > > The patches provide significant benefits for I/O-bound sequential > > operations, with the greatest improvements seen when using > > asynchronous I/O methods (io_uring and worker). The synchronous I/O > > mode shows reduced but still meaningful gains. > > > > --- Results by I/O Method > > > > Best Results: io_method=worker > > > > bloom_scan: 4.14x (75.9% faster); 93% fewer reads > > pgstattuple: 1.59x (37.1% faster); 94% fewer reads > > hash_vacuum: 1.05x (4.4% faster); 80% fewer reads > > gin_vacuum: 1.06x (5.6% faster); 15% fewer reads > > bloom_vacuum: 1.04x (3.9% faster); 76% fewer reads > > wal_logging: 0.98x (-2.5%, neutral/slightly slower); no change in reads > > > > io_method=io_uring > > > > bloom_scan: 3.12x (68.0% faster); 93% fewer reads > > pgstattuple: 1.50x (33.2% faster); 94% fewer reads > > hash_vacuum: 1.03x (3.3% faster); 80% fewer reads > > gin_vacuum: 1.02x (2.1% faster); 15% fewer reads > > bloom_vacuum: 1.03x (3.4% faster); 76% fewer reads > > wal_logging: 1.00x (-0.5%, neutral); no change in reads > > > > io_method=sync (baseline comparison) > > > > bloom_scan: 1.20x (16.4% faster); 93% fewer reads > > pgstattuple: 1.10x (9.0% faster); 94% fewer reads > > hash_vacuum: 1.01x (0.8% faster); 80% fewer reads > > gin_vacuum: 1.02x (1.7% faster); 15% fewer reads > > bloom_vacuum: 1.03x (2.8% faster); 76% fewer reads > > wal_logging: 0.99x (-0.7%, neutral); no change in reads > > > > --- Observations --- > > > > Async I/O amplifies streaming benefits: The same patches show 3-4x > > improvement with worker/io_uring vs 1.2x with sync. > > > > I/O operation reduction is consistent: All modes show the same ~93-94% > > reduction in I/O operations for bloom_scan and pgstattuple. > > > > VACUUM operations show modest gains: Despite large I/O reductions > > (76-80%), wall-clock improvements are smaller (3-15%) since VACUUM has > > larger CPU overhead (tuple processing, index maintenance, WAL > > logging). > > > > log_newpage_range shows no benefit: The patch provides no improvement > > (~0.97x). > > > > -- > > Best, > > Xuneng > > There was an issue in the wal_log test of the original script. > > --- The original benchmark used: > ALTER TABLE ... SET LOGGED > > This path performs a full table rewrite via ATRewriteTable() > (tablecmds.c). It creates a new relfilenode and copies tuples into it. > It does not call log_newpage_range() on rewritten pages. > > log_newpage_range() may only appear indirectly through the > pending-sync logic in storage.c, and only when: > > wal_level = minimal, and > relation size < wal_skip_threshold (default 2MB). > > Our test tables (1M–20M rows) are far larger than 2MB. In that case, > PostgreSQL fsyncs the file instead of WAL-logging it. Therefore, the > previous benchmark measured table rewrite I/O, not the > log_newpage_range() path. > > --- Current design: GIN index build > > The benchmark now uses: > CREATE INDEX ... USING gin (doc_tsv) > > This reliably exercises log_newpage_range() because: > - ginbuild() constructs the index and WAL-logs all new index pages > using log_newpage_range(). > - This is part of the normal GIN build path, independent of > wal_skip_threshold. > - The streaming-read patch modifies the WAL logging path inside > log_newpage_range(), which this test directly targets. > > --- Results (wal_logging_large) > worker: 1.00x (+0.5%); no meaningful change in reads > io_uring: 1.01x (+1.3%); no meaningful change in reads > sync: 1.01x (+1.1%); no meaningful change in reads > > -- > Best, > Xuneng
Here’s v5 of the patchset. The wal_logging_large patch has been removed, as no performance gains were observed in the benchmark runs. -- Best, Xuneng
v5-0001-Switch-Bloom-scan-paths-to-streaming-read.patch
Description: Binary data
v5-0003-Streamify-heap-bloat-estimation-scan.-Introduce-a.patch
Description: Binary data
v5-0004-Replace-synchronous-ReadBufferExtended-loop-with-.patch
Description: Binary data
v5-0002-Streamify-Bloom-VACUUM-paths.patch
Description: Binary data
v5-0005-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch
Description: Binary data
