On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> >  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
> You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
> bit should only be updated by actual writes.  Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

> We also do have a way of transplanting pages: it's called splice.  How
> do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

> >  One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)


Andres Freund

 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to