On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > Well, if we were to collaborate with the kernel community on this then
> > > presumably we can do better than that for eviction... even to the
> > > extent of "here's some data from this range in this file. It's (clean|
> > > dirty). Put it in your cache. Just trust me on this."
> > This should be the madvise() interface (with MADV_WILLNEED and
> > MADV_DONTNEED) is there something in that interface that is
> > insufficient?
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces).
I understand, that's why you get double buffering: because we can't
replace a page in the range you give us on read/write. However, you
don't have to switch entirely to mmap: you can use mmap/madvise
exclusively for cache control and still use read/write (and still pay
the double buffer penalty, of course). It's only read/write with
directio that would cause problems here (unless you're planning to
switch to DIO?).
> Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.
Is this because of problems in the mmap_sem?
> Also, that will mark that page dirty, which isn't what we want in this
You mean madvise (page_addr)? It shouldn't ... the state of the dirty
bit should only be updated by actual writes. Which MADV_ primitive is
causing the dirty marking, because we might be able to fix it (unless
there's some weird corner case I don't know about).
> One major usecase is transplanting a page comming from postgres'
> buffers into the kernel's buffercache because the latter has a much
> better chance of properly allocating system resources across independent
> applications running.
If you want to share pages between the application and the page cache,
the only known interface is mmap ... perhaps we can discuss how better
to improve mmap for you?
We also do have a way of transplanting pages: it's called splice. How
do the semantics of splice differ from what you need?
> Oh, and the kernel's page-cache management while far from perfect,
> actually scales much better than postgres'.
Well, then, it sounds like the best way forward would be to get
postgress to use the kernel page cache more efficiently.
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: