pgstattuple: Use streaming read API in pgstatindex functions

Xuneng Zhou Sun, 12 Oct 2025 19:08:06 -0700

Hi hackers,

While reading the code related to streaming reads and their current
use cases, I noticed that pgstatindex could potentially benefit from
adopting the streaming read API. The required change is relatively
simple—similar to what has already been implemented in the pg_warm and
pg_visibility extensions. I also ran some performance tests on an
experimental patch to validate the improvement.


Summary
Cold cache performance (the typical use case for diagnostic tools):
  - Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
  - Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
  - Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)

Hardware: AX162-R from hetzner

Test matrix:
  - Index types: Primary key, timestamp, float, composite (3 columns)
  - Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
(50M rows, ~ 1351MB))
  - Layouts: Unfragmented (sequential) and Fragmented (random insert order)
  - Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)

Xlarge fragmented example:
==> Creating secondary indexes on test_xlarge
     Created 3 secondary indexes: created_at, score, composite
     Created test_xlarge_pkey: 1351 MB
     Fragmentation stats (random insert order):
 leaf_frag_pct | avg_density_pct | leaf_pages |  size
---------------+-----------------+------------+---------
          49.9 |            71.5 |     172272 | 1351 MB
(1 row)

configuration:
  - shared_buffers = 16GB
  - effective_io_concurrency = 500
  - io_combine_limit = 16
  - autovacuum = off
  - checkpoint_timeout = 1h
  - bgwriter_delay = 10000ms (minimize background writes)
  - jit = off
  - max_parallel_workers_per_gather = 0

Unfragmented Indexes (Cold Cache)

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  31.5 ms  19.6 ms  1.58×
Primary Key    Large  184.0 ms  119.0 ms  1.54×
Timestamp     Medium  13.4 ms  10.5 ms  1.28×
Timestamp     Large  85.0 ms  45.6 ms  1.86×
Float (score)   Medium  13.7 ms  11.4 ms  1.21×
Float (score)   Large  84.0 ms  45.0 ms  1.86×
Composite (3 col) Medium  26.7 ms  17.2 ms  1.56×
Composite (3 col) Large  89.8 ms  51.2 ms  1.75×

⸻

Fragmented Indexes (Cold Cache)

To address concerns about filesystem fragmentation, I tested indexes built
with random inserts (ORDER BY random()) to trigger page splits and create
fragmented indexes:

Index Type    Size  Baseline  Patched  Speedup
Primary Key    Medium  41.9 ms  23.5 ms  1.79×
Primary Key    Large  236.0 ms  148.0 ms  1.58×
Primary Key    XLarge  953.4 ms  663.1 ms  1.43×
Timestamp     Medium  32.1 ms  18.8 ms  1.70×
Timestamp     Large  188.0 ms  117.0 ms  1.59×
Timestamp     XLarge  493.0 ms  518.6 ms  0.95×
Float (score)   Medium  14.0 ms  10.9 ms  1.28×
Float (score)   Large  85.8 ms  45.2 ms  1.89×
Float (score)   XLarge  263.2 ms  176.5 ms  1.49×
Composite (3 col) Medium  42.3 ms  24.1 ms  1.75×
Composite (3 col) Large  245.0 ms  162.0 ms  1.51×
Composite (3 col) XLarge  1052.5 ms  716.5 ms  1.46×

Summary: Fragmentation generally does not hurt streaming reads; most
fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
Timestamp) shows a slight regression (0.95×).

⸻

Warm Cache Results
When indexes are fully cached in shared_buffers:
Unfragmented:  infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fragmented:   infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index

Best,
Xuneng

From 153ab2799803dc402789d0aa825456ea12f2d3d9 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Sun, 12 Oct 2025 21:27:22 +0800
Subject: [PATCH] pgstattuple: Use streaming read API in pgstatindex functions

Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
 contrib/pgstattuple/pgstatindex.c | 205 ++++++++++++++++++------------
 1 file changed, 121 insertions(+), 84 deletions(-)

diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..a708cc417b0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
 
@@ -273,60 +274,77 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 	indexStat.fragments = 0;
 
 	/*
-	 * Scan all blocks except the metapage
+	 * Scan all blocks except the metapage using streaming reads
 	 */
 	nblocks = RelationGetNumberOfBlocks(rel);
 
-	for (blkno = 1; blkno < nblocks; blkno++)
 	{
-		Buffer		buffer;
-		Page		page;
-		BTPageOpaque opaque;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/* Read and lock buffer */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
-		LockBuffer(buffer, BUFFER_LOCK_SHARE);
-
-		page = BufferGetPage(buffer);
-		opaque = BTPageGetOpaque(page);
-
-		/*
-		 * Determine page type, and update totals.
-		 *
-		 * Note that we arbitrarily bucket deleted pages together without
-		 * considering if they're leaf pages or internal pages.
-		 */
-		if (P_ISDELETED(opaque))
-			indexStat.deleted_pages++;
-		else if (P_IGNORE(opaque))
-			indexStat.empty_pages++;	/* this is the "half dead" state */
-		else if (P_ISLEAF(opaque))
+		BlockRangeReadStreamPrivate p;
+		ReadStream *stream;
+
+		p.current_blocknum = 1;
+		p.last_exclusive = nblocks;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+
+		for (blkno = 1; blkno < nblocks; blkno++)
 		{
-			int			max_avail;
+			Buffer		buffer;
+			Page		page;
+			BTPageOpaque opaque;
 
-			max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
-			indexStat.max_avail += max_avail;
-			indexStat.free_space += PageGetExactFreeSpace(page);
+			CHECK_FOR_INTERRUPTS();
 
-			indexStat.leaf_pages++;
+			buffer = read_stream_next_buffer(stream, NULL);
+			LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+			page = BufferGetPage(buffer);
+			opaque = BTPageGetOpaque(page);
 
 			/*
-			 * If the next leaf is on an earlier block, it means a
-			 * fragmentation.
+			 * Determine page type, and update totals.
+			 *
+			 * Note that we arbitrarily bucket deleted pages together without
+			 * considering if they're leaf pages or internal pages.
 			 */
-			if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
-				indexStat.fragments++;
-		}
-		else
-			indexStat.internal_pages++;
+			if (P_ISDELETED(opaque))
+				indexStat.deleted_pages++;
+			else if (P_IGNORE(opaque))
+				indexStat.empty_pages++;	/* this is the "half dead" state */
+			else if (P_ISLEAF(opaque))
+			{
+				int			max_avail;
 
-		/* Unlock and release buffer */
-		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-		ReleaseBuffer(buffer);
+				max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
+				indexStat.max_avail += max_avail;
+				indexStat.free_space += PageGetExactFreeSpace(page);
+
+				indexStat.leaf_pages++;
+
+				/*
+				 * If the next leaf is on an earlier block, it means a
+				 * fragmentation.
+				 */
+				if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
+					indexStat.fragments++;
+			}
+			else
+				indexStat.internal_pages++;
+
+		UnlockReleaseBuffer(buffer);
 	}
 
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
+}
+
 	relation_close(rel, AccessShareLock);
 
 	/*----------------------------
@@ -636,60 +654,79 @@ pgstathashindex(PG_FUNCTION_ARGS)
 	/* prepare access strategy for this index */
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	/* Start from blkno 1 as 0th block is metapage */
-	for (blkno = 1; blkno < nblocks; blkno++)
+	/* Scan all blocks except the metapage using streaming reads */
 	{
-		Buffer		buf;
-		Page		page;
-
-		CHECK_FOR_INTERRUPTS();
-
-		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-								 bstrategy);
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
-			stats.unused_pages++;
-		else if (PageGetSpecialSize(page) !=
-				 MAXALIGN(sizeof(HashPageOpaqueData)))
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" contains corrupted page at block %u",
-							RelationGetRelationName(rel),
-							BufferGetBlockNumber(buf))));
-		else
+		BlockRangeReadStreamPrivate p;
+		ReadStream *stream;
+
+		p.current_blocknum = 1;
+		p.last_exclusive = nblocks;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL |
+											READ_STREAM_USE_BATCHING,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+
+		for (blkno = 1; blkno < nblocks; blkno++)
 		{
-			HashPageOpaque opaque;
-			int			pagetype;
+			Buffer		buf;
+			Page		page;
 
-			opaque = HashPageGetOpaque(page);
-			pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+			CHECK_FOR_INTERRUPTS();
 
-			if (pagetype == LH_BUCKET_PAGE)
-			{
-				stats.bucket_pages++;
-				GetHashPageStats(page, &stats);
-			}
-			else if (pagetype == LH_OVERFLOW_PAGE)
-			{
-				stats.overflow_pages++;
-				GetHashPageStats(page, &stats);
-			}
-			else if (pagetype == LH_BITMAP_PAGE)
-				stats.bitmap_pages++;
-			else if (pagetype == LH_UNUSED_PAGE)
+			buf = read_stream_next_buffer(stream, NULL);
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+			page = BufferGetPage(buf);
+
+			if (PageIsNew(page))
 				stats.unused_pages++;
-			else
+			else if (PageGetSpecialSize(page) !=
+					 MAXALIGN(sizeof(HashPageOpaqueData)))
 				ereport(ERROR,
 						(errcode(ERRCODE_INDEX_CORRUPTED),
-						 errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
-								opaque->hasho_flag, RelationGetRelationName(rel),
+						 errmsg("index \"%s\" contains corrupted page at block %u",
+								RelationGetRelationName(rel),
 								BufferGetBlockNumber(buf))));
-		}
+			else
+			{
+				HashPageOpaque opaque;
+				int			pagetype;
+
+				opaque = HashPageGetOpaque(page);
+				pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+
+				if (pagetype == LH_BUCKET_PAGE)
+				{
+					stats.bucket_pages++;
+					GetHashPageStats(page, &stats);
+				}
+				else if (pagetype == LH_OVERFLOW_PAGE)
+				{
+					stats.overflow_pages++;
+					GetHashPageStats(page, &stats);
+				}
+				else if (pagetype == LH_BITMAP_PAGE)
+					stats.bitmap_pages++;
+				else if (pagetype == LH_UNUSED_PAGE)
+					stats.unused_pages++;
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
+									opaque->hasho_flag, RelationGetRelationName(rel),
+									BufferGetBlockNumber(buf))));
+			}
 		UnlockReleaseBuffer(buf);
 	}
 
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
+}
+
 	/* Done accessing the index */
 	index_close(rel, AccessShareLock);
 
-- 
2.51.0

pgstattuple: Use streaming read API in pgstatindex functions

Reply via email to