Thomas, We're using a zfs recordsize of 8k to match the PG blocksize of 8k, so what you're describing is not the issue here.
Thanks, Jerry On Thu, Jul 5, 2018 at 3:44 PM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Fri, Jul 6, 2018 at 3:37 AM, Jerry Jelinek <jerry.jeli...@joyent.com> > wrote: > >> If the problem is specifically the file system caching behavior, then we > >> could also consider using the dreaded posix_fadvise(). > > > > I'm not sure that solves the problem for non-cached files, which is where > > we've observed the performance impact of recycling, where what should be > a > > write intensive workload turns into a read-modify-write workload because > > we're now reading an old WAL file that is many hours, or even days, old > and > > has thus fallen out of the memory-cached data for the filesystem. The > disk > > reads still have to happen. > > What ZFS record size are you using? PostgreSQL's XLOG_BLCKSZ is usually > 8192 bytes. When XLogWrite() calls write(some multiple of XLOG_BLCKSZ), on > a traditional filesystem the kernel will say 'oh, that's overwriting whole > pages exactly, so I have no need to read it from disk' (for example in > FreeBSD ffs_vnops.c ffs_write() see the comment "We must peform a > read-before-write if the transfer size does not cover the entire buffer"). > I assume ZFS has a similar optimisation, but it uses much larger records > than the traditional 4096 byte pages, defaulting to 128KB. Is that the > reason for this? > > -- > Thomas Munro > http://www.enterprisedb.com >