On Thursday 2025-10-16 19:01, Tom Lane wrote:
I think this is more or less committable, and then we could get
back to the original question of whether it's worth tweaking
pg_restore's seek-vs-scan behavior.
And done. Dimitrios, could you re-do your testing against current
HEAD, and see if there's still a benefit to tweaking pg_restore's
seek-vs-read decisions, and if so what's the best number?
Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.
I notice that things have improved immensely!
Using the test you suggested (see NOTE1):
pg_restore -t last_table -f /dev/null huge.pg_dump
1. The strace output is much more reasonable now; basically it's
repeating the pattern
read(4k)
lseek(~128k forward)
As a reminder, with old archives it was repeating the pattern:
read(4k)
lseek(4k forward)
lseek(same offset as above) x ~80 times
2. The IO speed is better than before:
On my 20TB HDD I get 30-50 MB/s read rate.
With old archives I get 10-20 MB/s read rate.
3. Time to complete: ~25 min
4. CPU usage is low. With old archives the pg_restore process shows
high *system* CPU (because of the amount of syscalls).
I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.
NOTE1: My original testcase was
pg_restore -t last_table -j $NCPU -d testdb
This testcase does not show as big improvement,
because every single of the parallel workers is
concurrently seeking through the dump file.
*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.
*** Below I have applied attached patch ***
Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?
Anyway I tried it with the new archive from above.
1. strace output is a loop of the following:
read(4k)
read(~128k)
2. Read rate is between 150-250MB/s basically max that the HDD can give.
3. Time to complete: ~5 min
4. CPU usage: HIGH (63%), most likely because of the sheer amount
of data it's parsing.
Regards,
Dimitris
From dcbbe92ba16b0c1dfe6320960bc2882b15850de6 Mon Sep 17 00:00:00 2001
From: Dimitrios Apostolou <[email protected]>
Date: Sat, 29 Mar 2025 01:16:07 +0100
Subject: [PATCH v4] parallel pg_restore: avoid disk seeks when moving short
distance forward
Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.
Also speeds up restoration of specific tables (-t tablename).
In these cases, before the actual data restoration starts, pg_restore
workers manifest constant looping of reading small sizes (4KB) and
seeking forward small lenths (around 10KB for a compressed archive or
even only a few bytes for uncompressed ones):
read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET) = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET) = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET) = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET) = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET) = 55544426496
This happens as each worker has to scan the whole file until it finds
the entry it wants, skipping forward each block. In combination to the
small block size of the custom format dump, this causes many seeks and
low performance.
Fix by avoiding forward seeks for jumps of less than 1MB forward.
Do instead sequential reads.
Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.
---
src/bin/pg_dump/pg_backup_custom.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index f7c3af56304..27695e24dde 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -623,19 +623,23 @@ _skipData(ArchiveHandle *AH)
{
lclContext *ctx = (lclContext *) AH->formatData;
size_t blkLen;
char *buf = NULL;
int buflen = 0;
blkLen = ReadInt(AH);
while (blkLen != 0)
{
- if (ctx->hasSeek)
+ /*
+ * Sequential access is usually faster, so avoid seeking if the jump
+ * forward is shorter than 1MB.
+ */
+ if (ctx->hasSeek && blkLen > 1024 * 1024)
{
if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
pg_fatal("error during file seek: %m");
}
else
{
if (blkLen > buflen)
{
free(buf);
--
2.51.0