Hello,

I would greatly appreciate your insight into an issue where pg_restore
runs significantly slower than expected, even though the underlying
storage shows very high random write throughput. I am trying to
understand which PostgreSQL mechanisms or system layers I should
investigate next in order to pinpoint the bottleneck and improve
restore performance.

The central question is: What should I examine further to understand
why checkpoint processing becomes the dominant bottleneck during
restore, despite fsync=off, synchronous_commit=off, and excellent
random write latency?

Below is a detailed description of the environment, the behavior
observed, the steps I have already taken, and the research performed
so far.

During pg_restore, execution time remains extremely long: around 2+
hours using a custom-format dump and over 4 hours using directory
format. The machine consistently demonstrates high random write
performance (median latency ~5 ms, ~45k random write IOPS), yet
PostgreSQL logs show very long checkpoints where the write phase
dominates (hundreds to thousands of seconds). Checkpoints appear to
stall the entire restore process.

I have tested multiple combinations of dump formats (custom and
directory) and parallel jobs (j = 1, 12, 18). The restore duration
barely changes. This strongly suggests that the bottleneck is not
client-side parallelism but internal server behavior—specifically the
checkpoint write phase.

Example log excerpts show checkpoint write times consistently in the
range of 600–1100 seconds, with large numbers of buffers written (from
hundreds of thousands to over 1.6 million). Sync times remain
negligible because fsync is disabled, reinforcing the suspicion that
PostgreSQL's internal buffer flushing and write throttling mechanisms
are the source of slowdown, not WAL or filesystem sync.

Given that:

* Storage is fast,
* fsync and synchronous commits are disabled,
* full_page_writes is off,
* wal_level is minimal,
* autovacuum is off,
* the restore is the only workload,

I am trying to determine what further PostgreSQL internals or Linux
I/O mechanisms may explain why these checkpoints are taking orders of
magnitude longer than the device’s raw write characteristics would
suggest.

I am particularly looking for guidance on:

* Whether backend or checkpointer write throttling may still be
limiting write concurrency even during bulk restore,
* Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could
introduce any serialization invisible to raw I/O tests,
* Whether certain parameters (e.g., effective_io_concurrency,
maintenance_io_concurrency, wal_writer settings, combine limits,
io_uring behavior) could unintentionally reduce write throughput,
* Whether parallel pg_restore is inherently constrained by global
buffer flushing behavior,
* Any other PostgreSQL mechanisms that could cause prolonged
checkpoint write durations even with crash-safety disabled.

Below are the configuration values and environment details referenced above.

Machine:
Hyper-V VM
24 vCPU
80 GB RAM
Ubuntu 24.04.3 (kernel 6.8.0-88)
PostgreSQL 18.1

Database size:
~700 GB across two tablespaces on separate disks (freshly restored)

Storage layout:
Each disk is its own VHDX
LVM on battery-backed SSD array
XFS for PGDATA
Barriers disabled

Random write performance (steady state):
Median latency: 5.1 ms
IOPS: ~45.6k

Restore tests:
pg_restore custom format: ~2h+
pg_restore directory format: ~4h+
Parallelism tested with j = 1, 12, 18, 24

Representative checkpoint log entries:
(write phases ranging 76–1079 seconds, buffer writes up to 1.6M)

postgresql.conf (relevant parts):
shared_buffers = 20GB
work_mem = 150MB
maintenance_work_mem = 8GB
effective_io_concurrency = 1
maintenance_io_concurrency = 1
io_max_combine_limit = 512kB
io_combine_limit = 1024kB
io_method = io_uring

fsync = off
synchronous_commit = off
wal_sync_method = fdatasync
full_page_writes = off
wal_compression = lz4

checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
max_wal_size = 80GB
min_wal_size = 10GB

effective_cache_size = 65GB
autovacuum = off
max_locks_per_transaction = 256

If anyone has encountered similar behavior or can recommend specific
PostgreSQL subsystems, kernel settings, or I/O patterns worth
investigating, I would be very grateful for advice. My main goal is to
understand why checkpoint writes are so slow relative to the
hardware’s demonstrated capabilities, and how to safely accelerate the
restore workflow.

Thank you in advance for any guidance.


Reply via email to