On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> > Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> Is this because of problems in the mmap_sem?
It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.
> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
> You mean madvise (page_addr)? It shouldn't ... the state of the dirty
> bit should only be updated by actual writes. Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).
Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?
> We also do have a way of transplanting pages: it's called splice. How
> do the semantics of splice differ from what you need?
Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.
> > One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?
I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.
> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.
No arguments there, although working on postgres scalability is a good
idea as well ;)
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: