If the dump was taken with pd_dump -Fd and pg_restore -j has no effect on restore time, that’s a good clue. You can start with testing deferring checkpoints, by setting wal_max_size = 1TB and checkpoint_timeout = 10h, and see how this affects the pg_restore (should be limited by WAL write throughput). Perhaps increase wal_buffers to 128MB. The idea being to identify (by elimination) the write chock-point, before starting to tune for it.
Irrelevant for your problem, you should set the *_io_concurrency to 200 since you use SSDs. pg_restore rebuilds indices so also make sure the settings relevant to index building are set appropriately (see max_parallel_* and *_io_concurrency) Kiriakos Georgiou > On Dec 5, 2025, at 5:30 AM, MentionTheElephant - MentionTheElephant at > gmail.com <[email protected]> wrote: > > Hello, > > I would greatly appreciate your insight into an issue where pg_restore > runs significantly slower than expected, even though the underlying > storage shows very high random write throughput. I am trying to > understand which PostgreSQL mechanisms or system layers I should > investigate next in order to pinpoint the bottleneck and improve > restore performance. > > The central question is: What should I examine further to understand > why checkpoint processing becomes the dominant bottleneck during > restore, despite fsync=off, synchronous_commit=off, and excellent > random write latency? > > Below is a detailed description of the environment, the behavior > observed, the steps I have already taken, and the research performed > so far. > > During pg_restore, execution time remains extremely long: around 2+ > hours using a custom-format dump and over 4 hours using directory > format. The machine consistently demonstrates high random write > performance (median latency ~5 ms, ~45k random write IOPS), yet > PostgreSQL logs show very long checkpoints where the write phase > dominates (hundreds to thousands of seconds). Checkpoints appear to > stall the entire restore process. > > I have tested multiple combinations of dump formats (custom and > directory) and parallel jobs (j = 1, 12, 18). The restore duration > barely changes. This strongly suggests that the bottleneck is not > client-side parallelism but internal server behavior—specifically the > checkpoint write phase. > > Example log excerpts show checkpoint write times consistently in the > range of 600–1100 seconds, with large numbers of buffers written (from > hundreds of thousands to over 1.6 million). Sync times remain > negligible because fsync is disabled, reinforcing the suspicion that > PostgreSQL's internal buffer flushing and write throttling mechanisms > are the source of slowdown, not WAL or filesystem sync. > > Given that: > > * Storage is fast, > * fsync and synchronous commits are disabled, > * full_page_writes is off, > * wal_level is minimal, > * autovacuum is off, > * the restore is the only workload, > > I am trying to determine what further PostgreSQL internals or Linux > I/O mechanisms may explain why these checkpoints are taking orders of > magnitude longer than the device’s raw write characteristics would > suggest. > > I am particularly looking for guidance on: > > * Whether backend or checkpointer write throttling may still be > limiting write concurrency even during bulk restore, > * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could > introduce any serialization invisible to raw I/O tests, > * Whether certain parameters (e.g., effective_io_concurrency, > maintenance_io_concurrency, wal_writer settings, combine limits, > io_uring behavior) could unintentionally reduce write throughput, > * Whether parallel pg_restore is inherently constrained by global > buffer flushing behavior, > * Any other PostgreSQL mechanisms that could cause prolonged > checkpoint write durations even with crash-safety disabled. > > Below are the configuration values and environment details referenced above. > > Machine: > Hyper-V VM > 24 vCPU > 80 GB RAM > Ubuntu 24.04.3 (kernel 6.8.0-88) > PostgreSQL 18.1 > > Database size: > ~700 GB across two tablespaces on separate disks (freshly restored) > > Storage layout: > Each disk is its own VHDX > LVM on battery-backed SSD array > XFS for PGDATA > Barriers disabled > > Random write performance (steady state): > Median latency: 5.1 ms > IOPS: ~45.6k > > Restore tests: > pg_restore custom format: ~2h+ > pg_restore directory format: ~4h+ > Parallelism tested with j = 1, 12, 18, 24 > > Representative checkpoint log entries: > (write phases ranging 76–1079 seconds, buffer writes up to 1.6M) > > postgresql.conf (relevant parts): > shared_buffers = 20GB > work_mem = 150MB > maintenance_work_mem = 8GB > effective_io_concurrency = 1 > maintenance_io_concurrency = 1 > io_max_combine_limit = 512kB > io_combine_limit = 1024kB > io_method = io_uring > > fsync = off > synchronous_commit = off > wal_sync_method = fdatasync > full_page_writes = off > wal_compression = lz4 > > checkpoint_timeout = 60min > checkpoint_completion_target = 0.9 > max_wal_size = 80GB > min_wal_size = 10GB > > effective_cache_size = 65GB > autovacuum = off > max_locks_per_transaction = 256 > > If anyone has encountered similar behavior or can recommend specific > PostgreSQL subsystems, kernel settings, or I/O patterns worth > investigating, I would be very grateful for advice. My main goal is to > understand why checkpoint writes are so slow relative to the > hardware’s demonstrated capabilities, and how to safely accelerate the > restore workflow. > > Thank you in advance for any guidance. > > >
