tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL files appears to significantly degrade query latency on ZFS. The reason is complicated and I have details below. Has it been considered to make this behavior tunable, to cause PostgreSQL to always create new WAL files instead of re-using old ones?
Context: we're running a large fleet of PostgreSQL shards. Each shard consists of a primary, a synchronous standby, and an asynchronous standby using chained replication. For this problem, we can consider only the primary and synchronous standby. PostgreSQL: 9.6.3 OS: illumos (SmartOS, mixed versions, but all from 2017 or later) FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match PostgreSQL's record size. We have an SSD log device for completing synchronous writes very quickly. WAL files are 16MB each, and we're keeping a very large number of segments. (There's likely a lot of improvement for WAL tuning here.) Since we're using an 8K record size, when PostgreSQL writes small (or non-aligned) records to the WAL files, ZFS has to read the old contents in order to write the new 8K record (read-modify-write). If that's in cache, that's not a big deal. But when PostgreSQL decides to reuse an old WAL file whose contents have been evicted from the cache (because they haven't been used in hours), this turns what should be a workload bottlenecked by synchronous write performance (that's well-optimized with our SSD log device) into a random read workload (that's much more expensive for any system). What's significantly worse is that we saw this on synchronous standbys. When that happened, the WAL receiver was blocked on a random read from disk, and since it's single-threaded, all write queries on the primary stop until the random read finishes. This is particularly bad for us when the sync is doing other I/O (e.g., for an autovacuum or a database backup) that causes disk reads to take hundreds of milliseconds. Reusing old WAL files seems like an optimization intended for filesystems that allocate disk blocks up front. With copy-on-write, it doesn't seem to make much sense. If instead of using an old WAL file, PostgreSQL instead just created a new one, there would be no random reads required to complete these operations, and we believe we'd completely eliminate our latency outliers. Thanks, Dave