Hi hackers,
While reading the code related to streaming reads and their current
use cases, I noticed that pgstatindex could potentially benefit from
adopting the streaming read API. The required change is relatively
simple—similar to what has already been implemented in the pg_warm and
pg_visibility extensions. I also ran some performance tests on an
experimental patch to validate the improvement.
Summary
Cold cache performance (the typical use case for diagnostic tools):
- Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup)
- Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup)
- Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup)
Hardware: AX162-R from hetzner
Test matrix:
- Index types: Primary key, timestamp, float, composite (3 columns)
- Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge
(50M rows, ~ 1351MB))
- Layouts: Unfragmented (sequential) and Fragmented (random insert order)
- Cache states: Cold (dropped OS cache) and Warm (pg_prewarm)
Xlarge fragmented example:
==> Creating secondary indexes on test_xlarge
Created 3 secondary indexes: created_at, score, composite
Created test_xlarge_pkey: 1351 MB
Fragmentation stats (random insert order):
leaf_frag_pct | avg_density_pct | leaf_pages | size
---------------+-----------------+------------+---------
49.9 | 71.5 | 172272 | 1351 MB
(1 row)
configuration:
- shared_buffers = 16GB
- effective_io_concurrency = 500
- io_combine_limit = 16
- autovacuum = off
- checkpoint_timeout = 1h
- bgwriter_delay = 10000ms (minimize background writes)
- jit = off
- max_parallel_workers_per_gather = 0
Unfragmented Indexes (Cold Cache)
Index Type Size Baseline Patched Speedup
Primary Key Medium 31.5 ms 19.6 ms 1.58×
Primary Key Large 184.0 ms 119.0 ms 1.54×
Timestamp Medium 13.4 ms 10.5 ms 1.28×
Timestamp Large 85.0 ms 45.6 ms 1.86×
Float (score) Medium 13.7 ms 11.4 ms 1.21×
Float (score) Large 84.0 ms 45.0 ms 1.86×
Composite (3 col) Medium 26.7 ms 17.2 ms 1.56×
Composite (3 col) Large 89.8 ms 51.2 ms 1.75×
⸻
Fragmented Indexes (Cold Cache)
To address concerns about filesystem fragmentation, I tested indexes built
with random inserts (ORDER BY random()) to trigger page splits and create
fragmented indexes:
Index Type Size Baseline Patched Speedup
Primary Key Medium 41.9 ms 23.5 ms 1.79×
Primary Key Large 236.0 ms 148.0 ms 1.58×
Primary Key XLarge 953.4 ms 663.1 ms 1.43×
Timestamp Medium 32.1 ms 18.8 ms 1.70×
Timestamp Large 188.0 ms 117.0 ms 1.59×
Timestamp XLarge 493.0 ms 518.6 ms 0.95×
Float (score) Medium 14.0 ms 10.9 ms 1.28×
Float (score) Large 85.8 ms 45.2 ms 1.89×
Float (score) XLarge 263.2 ms 176.5 ms 1.49×
Composite (3 col) Medium 42.3 ms 24.1 ms 1.75×
Composite (3 col) Large 245.0 ms 162.0 ms 1.51×
Composite (3 col) XLarge 1052.5 ms 716.5 ms 1.46×
Summary: Fragmentation generally does not hurt streaming reads; most
fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge
Timestamp) shows a slight regression (0.95×).
⸻
Warm Cache Results
When indexes are fully cached in shared_buffers:
Unfragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Fragmented: infrequent little regression for small to medium size
index(single digit ms variance, barely noticeable); small gains for
large size index
Best,
Xuneng
From 153ab2799803dc402789d0aa825456ea12f2d3d9 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Sun, 12 Oct 2025 21:27:22 +0800
Subject: [PATCH] pgstattuple: Use streaming read API in pgstatindex functions
Replace synchronous ReadBufferExtended() loops with the streaming read
API in pgstatindex_impl() and pgstathashindex().
---
contrib/pgstattuple/pgstatindex.c | 205 ++++++++++++++++++------------
1 file changed, 121 insertions(+), 84 deletions(-)
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 40823d54fca..a708cc417b0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -37,6 +37,7 @@
#include "funcapi.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -273,60 +274,77 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
indexStat.fragments = 0;
/*
- * Scan all blocks except the metapage
+ * Scan all blocks except the metapage using streaming reads
*/
nblocks = RelationGetNumberOfBlocks(rel);
- for (blkno = 1; blkno < nblocks; blkno++)
{
- Buffer buffer;
- Page page;
- BTPageOpaque opaque;
-
- CHECK_FOR_INTERRUPTS();
-
- /* Read and lock buffer */
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
- LockBuffer(buffer, BUFFER_LOCK_SHARE);
-
- page = BufferGetPage(buffer);
- opaque = BTPageGetOpaque(page);
-
- /*
- * Determine page type, and update totals.
- *
- * Note that we arbitrarily bucket deleted pages together without
- * considering if they're leaf pages or internal pages.
- */
- if (P_ISDELETED(opaque))
- indexStat.deleted_pages++;
- else if (P_IGNORE(opaque))
- indexStat.empty_pages++; /* this is the "half dead" state */
- else if (P_ISLEAF(opaque))
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- int max_avail;
+ Buffer buffer;
+ Page page;
+ BTPageOpaque opaque;
- max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
- indexStat.max_avail += max_avail;
- indexStat.free_space += PageGetExactFreeSpace(page);
+ CHECK_FOR_INTERRUPTS();
- indexStat.leaf_pages++;
+ buffer = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+ page = BufferGetPage(buffer);
+ opaque = BTPageGetOpaque(page);
/*
- * If the next leaf is on an earlier block, it means a
- * fragmentation.
+ * Determine page type, and update totals.
+ *
+ * Note that we arbitrarily bucket deleted pages together without
+ * considering if they're leaf pages or internal pages.
*/
- if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
- indexStat.fragments++;
- }
- else
- indexStat.internal_pages++;
+ if (P_ISDELETED(opaque))
+ indexStat.deleted_pages++;
+ else if (P_IGNORE(opaque))
+ indexStat.empty_pages++; /* this is the "half dead" state */
+ else if (P_ISLEAF(opaque))
+ {
+ int max_avail;
- /* Unlock and release buffer */
- LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buffer);
+ max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
+ indexStat.max_avail += max_avail;
+ indexStat.free_space += PageGetExactFreeSpace(page);
+
+ indexStat.leaf_pages++;
+
+ /*
+ * If the next leaf is on an earlier block, it means a
+ * fragmentation.
+ */
+ if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
+ indexStat.fragments++;
+ }
+ else
+ indexStat.internal_pages++;
+
+ UnlockReleaseBuffer(buffer);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+}
+
relation_close(rel, AccessShareLock);
/*----------------------------
@@ -636,60 +654,79 @@ pgstathashindex(PG_FUNCTION_ARGS)
/* prepare access strategy for this index */
bstrategy = GetAccessStrategy(BAS_BULKREAD);
- /* Start from blkno 1 as 0th block is metapage */
- for (blkno = 1; blkno < nblocks; blkno++)
+ /* Scan all blocks except the metapage using streaming reads */
{
- Buffer buf;
- Page page;
-
- CHECK_FOR_INTERRUPTS();
-
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- bstrategy);
- LockBuffer(buf, BUFFER_LOCK_SHARE);
- page = BufferGetPage(buf);
-
- if (PageIsNew(page))
- stats.unused_pages++;
- else if (PageGetSpecialSize(page) !=
- MAXALIGN(sizeof(HashPageOpaqueData)))
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("index \"%s\" contains corrupted page at block %u",
- RelationGetRelationName(rel),
- BufferGetBlockNumber(buf))));
- else
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream;
+
+ p.current_blocknum = 1;
+ p.last_exclusive = nblocks;
+
+ stream = read_stream_begin_relation(READ_STREAM_FULL |
+ READ_STREAM_USE_BATCHING,
+ bstrategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
+ for (blkno = 1; blkno < nblocks; blkno++)
{
- HashPageOpaque opaque;
- int pagetype;
+ Buffer buf;
+ Page page;
- opaque = HashPageGetOpaque(page);
- pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+ CHECK_FOR_INTERRUPTS();
- if (pagetype == LH_BUCKET_PAGE)
- {
- stats.bucket_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_OVERFLOW_PAGE)
- {
- stats.overflow_pages++;
- GetHashPageStats(page, &stats);
- }
- else if (pagetype == LH_BITMAP_PAGE)
- stats.bitmap_pages++;
- else if (pagetype == LH_UNUSED_PAGE)
+ buf = read_stream_next_buffer(stream, NULL);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
stats.unused_pages++;
- else
+ else if (PageGetSpecialSize(page) !=
+ MAXALIGN(sizeof(HashPageOpaqueData)))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
- opaque->hasho_flag, RelationGetRelationName(rel),
+ errmsg("index \"%s\" contains corrupted page at block %u",
+ RelationGetRelationName(rel),
BufferGetBlockNumber(buf))));
- }
+ else
+ {
+ HashPageOpaque opaque;
+ int pagetype;
+
+ opaque = HashPageGetOpaque(page);
+ pagetype = opaque->hasho_flag & LH_PAGE_TYPE;
+
+ if (pagetype == LH_BUCKET_PAGE)
+ {
+ stats.bucket_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_OVERFLOW_PAGE)
+ {
+ stats.overflow_pages++;
+ GetHashPageStats(page, &stats);
+ }
+ else if (pagetype == LH_BITMAP_PAGE)
+ stats.bitmap_pages++;
+ else if (pagetype == LH_UNUSED_PAGE)
+ stats.unused_pages++;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("unexpected page type 0x%04X in HASH index \"%s\" block %u",
+ opaque->hasho_flag, RelationGetRelationName(rel),
+ BufferGetBlockNumber(buf))));
+ }
UnlockReleaseBuffer(buf);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+}
+
/* Done accessing the index */
index_close(rel, AccessShareLock);
--
2.51.0