On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
> On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> > a file into a user provided buffer, thus obtaining a page cache entry
> > and a copy in their userspace buffer, then insert the page of the user
> > buffer back into the page cache as the page cache page ... that's right,
> > isn't it postgress people?
> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
> isn't needed anymore when reading. And we'd normally write if the page
> is dirty.

So why, exactly, do you even need the kernel page cache here? You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

> > Effectively you end up with buffered read/write that's also mapped into
> > the page cache.  It's a pretty awful way to hack around mmap.
> Well, the problem is that you can't really use mmap() for the things we
> do. Postgres' durability works by guaranteeing that our journal entries
> (called WAL := Write Ahead Log) are written & synced to disk before the
> corresponding entries of tables and indexes reach the disk. That also
> allows to group together many random-writes into a few contiguous writes
> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
> the data is then (slowly, in the background) synced to disk.

Which is the exact algorithm most journalling filesystems use for
ensuring durability of their metadata updates.  Indeed, here's an
interesting piece of architecture that you might like to consider:

* Neither XFS and BTRFS use the kernel page cache to back their
  metadata transaction engines.

Why not? Because the page cache is too simplistic to adequately
represent the complex object heirarchies that the filesystems have
and so it's flat LRU reclaim algorithms and writeback control
mechanisms are a terrible fit and cause lots of performance issues
under memory pressure.

IOWs, the two most complex high performance transaction engines in
the Linux kernel have moved to fully customised cache and (direct)
IO implementations because the requirements for scalability and
performance are far more complex than the kernel page cache
infrastructure can provide.

Just food for thought....


Dave Chinner

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to