On Tuesday 2025-10-21 00:23, Tom Lane wrote:
HEAD repeats
read(4k)
lseek(~128k forward)
which is to be expected if we have to read data block headers
that are ~128K apart; while patched repeats
read(4k)
read(~128k)
which is a bit odd in itself, why isn't it merging the reads better?
The read(4k) happens because of the getc() calls that read the next
block's length.
As noticed in a message above [1], glibc seems to do 4KB buffering by
default, for some reason. setvbuf() can mitigate this.
[1]
https://www.postgresql.org/message-id/1po8os49-r63o-2923-p37n-12698o1qn7p0%40tzk.arg
I'm attaching a patch that sets glibc buffering to 1MB just as a proof
of concept. It's obviously WIP, it allocates and never frees. :-)
Feel free to pick it up and change it as you see fit.
With this patch, read() calls are unified in strace. lseeks() remain,
even if they are not actually reading anything.
It seems to me that glibc could implement an optimisation for fseeko():
store the current position in the file, and do not issue the lseek()
system call if the position does not change.
I was using an HDD,
Ah. Your original message mentioned NVMe so I was assuming you
were also looking at solid-state drives. I can imagine that
seeking is more painful on HDDs ...
Sorry for the confusion, in all this time I've run tests on too many
different hardware combinations. Not the best way to draw conclusions,
but it's what I had available at each time.
Dimitris
From 56559e95fabbc498d7127db45992336f10a6cd93 Mon Sep 17 00:00:00 2001
From: Dimitrios Apostolou <[email protected]>
Date: Tue, 21 Oct 2025 15:55:47 +0200
Subject: [PATCH v1] WIP: increase glibc buffering for pg_restore custom format
---
src/bin/pg_dump/pg_backup_custom.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index f7c3af56304..79ffac315b5 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -160,18 +160,22 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
ctx->hasSeek = checkSeek(AH->FH);
}
else
{
if (AH->fSpec && strcmp(AH->fSpec, "") != 0)
{
AH->FH = fopen(AH->fSpec, PG_BINARY_R);
if (!AH->FH)
pg_fatal("could not open input file \"%s\": %m", AH->fSpec);
+ void *buf = pg_malloc(1024*1024);
+ int ret = setvbuf(AH->FH, buf, _IOFBF, 1024*1024);
+ if (ret != 0)
+ pg_fatal("setvbuf failed: %m");
}
else
{
AH->FH = stdin;
if (!AH->FH)
pg_fatal("could not open input file: %m");
}
ctx->hasSeek = checkSeek(AH->FH);
@@ -808,18 +812,23 @@ _ReopenArchive(ArchiveHandle *AH)
pg_fatal("could not close archive file: %m");
#endif
AH->FH = fopen(AH->fSpec, PG_BINARY_R);
if (!AH->FH)
pg_fatal("could not open input file \"%s\": %m", AH->fSpec);
if (fseeko(AH->FH, tpos, SEEK_SET) != 0)
pg_fatal("could not set seek position in archive file: %m");
+
+ void *buf = pg_malloc(1024*1024);
+ int ret = setvbuf(AH->FH, buf, _IOFBF, 1024*1024);
+ if (ret != 0)
+ pg_fatal("setvbuf failed: %m");
}
/*
* Prepare for parallel restore.
*
* The main thing that needs to happen here is to fill in TABLE DATA and BLOBS
* TOC entries' dataLength fields with appropriate values to guide the
* ordering of restore jobs. The source of said data is format-dependent,
* as is the exact meaning of the values.
--
2.51.0