On Tuesday 2025-10-21 00:15, Tom Lane wrote:
So for me, the proposed patch actually makes it 2X slower.
I went and tried this same test case on a 2024 Mac Mini M4 Pro.
Cutting to the chase:
HEAD:
$ time pg_restore -f /dev/null -t zedtable bench10000.dump
real 1m26.525s
user 0m0.364s
sys 0m6.806s
Patched:
$ time pg_restore -f /dev/null -t zedtable bench10000.dump
real 0m15.419s
user 0m0.279s
sys 0m8.224s
So on this hardware it *does* win (although maybe things would
be different for a parallel restore). The patched pg_restore
takes just about the same amount of time as "cat", and iostat
shows both of them reaching a bit more than 6GB/s read speed.
My feeling at this point is that we'd probably drop the block
size test as irrelevant, and instead simply ignore ctx->hasSeek
within this loop if we think we're on a platform where that's
the right thing. But how do we figure that out?
Not sure where we go from here, but clearly a bunch of research
is going to be needed to decide whether this is committable.
pg_dump files from before your latest fix still exist, and they possibly
contain block header every 30 bytes (or however wide is the table rows).
A patch in pg_restore would vastly improve this use case.
May I suggest the attached patch, which replaces fseeko() with fread()
if the distance is 32KB or less? Sounds rather improbable that this
would make things worse, but maybe it's possible to generate a dump file
with 32KB wide rows, and try restoring on various hardware?
If this too is controversial, then we can reduce the number to 4KB. This
is the buffering that glibc does internally. By using the same in the
given patch, we avoid all the lseek(same-offset) repetitions between the
4K reads. This should be a strict gain, with no downsides.
Dimitris
From 4676b4001598c101f452762fd212b903803e47ca Mon Sep 17 00:00:00 2001
From: Dimitrios Apostolou <[email protected]>
Date: Sat, 29 Mar 2025 01:16:07 +0100
Subject: [PATCH v5] parallel pg_restore: avoid disk seeks when moving short
distance forward
Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.
Also speeds up restoration of specific tables (-t tablename).
In these cases, before the actual data restoration starts, pg_restore
workers manifest constant looping of reading small sizes (4KB) and
seeking forward small lenths (around 10KB for a compressed archive or
even only a few bytes for uncompressed ones):
read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET) = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET) = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET) = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET) = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET) = 55544426496
This happens as each worker has to scan the whole file until it finds
the entry it wants, skipping forward each block. In combination to the
small block size of the custom format dump, this causes many seeks and
low performance.
Fix by avoiding forward seeks for jumping forward 32KB or less.
Do instead sequential reads.
Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.
---
src/bin/pg_dump/pg_backup_custom.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index f7c3af56304..087b1c81e7f 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -623,19 +623,23 @@ _skipData(ArchiveHandle *AH)
{
lclContext *ctx = (lclContext *) AH->formatData;
size_t blkLen;
char *buf = NULL;
int buflen = 0;
blkLen = ReadInt(AH);
while (blkLen != 0)
{
- if (ctx->hasSeek)
+ /*
+ * Sequential access is usually faster, so avoid seeking if the jump
+ * forward is 32KB or less.
+ */
+ if (ctx->hasSeek && blkLen > 32 * 1024)
{
if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
pg_fatal("error during file seek: %m");
}
else
{
if (blkLen > buflen)
{
free(buf);
--
2.51.0