Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

Dimitrios Apostolou Tue, 21 Oct 2025 07:16:50 -0700

On Tuesday 2025-10-21 00:15, Tom Lane wrote:

So for me, the proposed patch actually makes it 2X slower.


I went and tried this same test case on a 2024 Mac Mini M4 Pro.
Cutting to the chase:

HEAD:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    1m26.525s
user    0m0.364s
sys     0m6.806s

Patched:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    0m15.419s
user    0m0.279s
sys     0m8.224s

So on this hardware it *does* win (although maybe things would
be different for a parallel restore).  The patched pg_restore
takes just about the same amount of time as "cat", and iostat
shows both of them reaching a bit more than 6GB/s read speed.

My feeling at this point is that we'd probably drop the block
size test as irrelevant, and instead simply ignore ctx->hasSeek
within this loop if we think we're on a platform where that's
the right thing.  But how do we figure that out?

Not sure where we go from here, but clearly a bunch of research
is going to be needed to decide whether this is committable.

pg_dump files from before your latest fix still exist, and they possiblycontain block header every 30 bytes (or however wide is the table rows).A patch in pg_restore would vastly improve this use case.

May I suggest the attached patch, which replaces fseeko() with fread()if the distance is 32KB or less? Sounds rather improbable that thiswould make things worse, but maybe it's possible to generate a dump filewith 32KB wide rows, and try restoring on various hardware?

If this too is controversial, then we can reduce the number to 4KB. Thisis the buffering that glibc does internally. By using the same in thegiven patch, we avoid all the lseek(same-offset) repetitions between the4K reads. This should be a strict gain, with no downsides.




Dimitris

From 4676b4001598c101f452762fd212b903803e47ca Mon Sep 17 00:00:00 2001
From: Dimitrios Apostolou <[email protected]>
Date: Sat, 29 Mar 2025 01:16:07 +0100
Subject: [PATCH v5] parallel pg_restore: avoid disk seeks when moving short
 distance forward

Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.
Also speeds up restoration of specific tables (-t tablename).

In these cases, before the actual data restoration starts, pg_restore
workers manifest constant looping of reading small sizes (4KB) and
seeking forward small lenths (around 10KB for a compressed archive or
even only a few bytes for uncompressed ones):

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

This happens as each worker has to scan the whole file until it finds
the entry it wants, skipping forward each block. In combination to the
small block size of the custom format dump, this causes many seeks and
low performance.

Fix by avoiding forward seeks for jumping forward 32KB or less.
Do instead sequential reads.

Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.
---
 src/bin/pg_dump/pg_backup_custom.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index f7c3af56304..087b1c81e7f 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -623,19 +623,23 @@ _skipData(ArchiveHandle *AH)
 {
 	lclContext *ctx = (lclContext *) AH->formatData;
 	size_t		blkLen;
 	char	   *buf = NULL;
 	int			buflen = 0;
 
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (ctx->hasSeek)
+		/*
+		 * Sequential access is usually faster, so avoid seeking if the jump
+		 * forward is 32KB or less.
+		 */
+		if (ctx->hasSeek && blkLen > 32 * 1024)
 		{
 			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
 				pg_fatal("error during file seek: %m");
 		}
 		else
 		{
 			if (blkLen > buflen)
 			{
 				free(buf);
-- 
2.51.0

Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

Reply via email to