On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> ...  I
> suppose when you read a page in, you could tell the kernel that you
> POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
> tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
> advance somehow), on the theory that the coldest stuff in the PG cache
> should now become the hottest stuff in the OS cache.  Of course that
> sucks, because the best the kernel can do then is go and read it from
> disk, and the goal is to avoid IO.  Given a hypothetical way to
> "write" "clean" data to the kernel (so it wouldn't mark it dirty and
> generate IO, but it would let you read it back without generating IO
> if you're lucky), then perhaps you could actually achieve exclusive
> caching at the two levels, and use all your physical RAM without
> duplication.

Craig said essentially the same thing, on the nearby fsync() reliability thread:

On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> ... I'd kind of hoped to go in
> the other direction if anything, with some kind of pseudo-write op
> that let us swap a dirty shared_buffers entry from our shared_buffers
> into the OS dirty buffer cache (on Linux at least) and let it handle
> writeback, so we reduce double-buffering. Ha! So much for that!

I would like to reply to that on this thread which discusses double
buffering and performance, to avoid distracting the fsync() thread
from its main topic of reliability.

I think that idea has potential.  Even though I believe that direct IO
is the generally the right way to go (that's been RDBMS orthodoxy for
a decade or more AFAIK), we'll always want to support buffered IO (as
other RDBMSs do).  For one thing, not every filesystem supports direct
IO, including ZFS.  I love ZFS, and its caching is not simply a dumb
extension to shared_buffers that you have to go through syscalls to
reach: it has state of the art page reclamation, cached data can be
LZ4 compressed and there is an optional second level cache which can
live on fast storage.

Perhaps if you patched PostgreSQL to tell the OS that you won't need
pages you've just read, and that you will need pages you've just
evicted, you might be able to straighten out some of that U shape by
getting more exclusive caching at the two levels.  Queued writes would
still be double-buffered of course, at least until they complete.
Telling the OS to prefetch something that you already have a copy of
is annoying and expensive, though.

The pie-in-the-sky version of this idea would let you "swap" pages
with the kernel, as you put it.  Though I was thinking of clean pages,
not dirty ones.  Then there'd be a non-overlapping set of  pages from
your select-only pgbench in each cache.  Maybe that would look like
punread(fd, buf, size, offset) (!), or maybe write(fd, buf, size)
followed by fadvise(fd, offset, size,
FADV_I_PERSONALLY_GUARANTEE_THIS_DATA_IS_CLEAN_AND_I_CONSIDERED_CONCURRENCY_VERY_CAREFULLY),
or maybe pswap(read params... , unread params ...) to read new buffer
and unread old buffer at the same time.</crackpot-vapourware-OS>

Sadly, even if the simple non-pie-in-the-sky version of the above were
to work out and be beneficial on your favourite non-COW filesystem (on
which you might as well use direct IO and larger shared_buffers, some
day), it may currently be futile on ZFS because I think the fadvise
machinery might not even be hooked up (Solaris didn't believe in
fadvise on any filesystem IIRC).  Not sure, I hope I'm wrong about
that.

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to