On Mon, Jan 13, 2014 at 02:19:56PM -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> > On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > > Well, if we were to collaborate with the kernel community on this then
> > > > presumably we can do better than that for eviction... even to the
> > > > extent of "here's some data from this range in this file. It's (clean|
> > > > dirty). Put it in your cache. Just trust me on this."
> > >
> > > This should be the madvise() interface (with MADV_WILLNEED and
> > > MADV_DONTNEED) is there something in that interface that is
> > > insufficient?
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces).
> I understand, that's why you get double buffering: because we can't
> replace a page in the range you give us on read/write. However, you
> don't have to switch entirely to mmap: you can use mmap/madvise
> exclusively for cache control and still use read/write (and still pay
> the double buffer penalty, of course). It's only read/write with
> directio that would cause problems here (unless you're planning to
> switch to DIO?).
There are hazards with using mmap/madvise that may or may not be a problem
for them. I think these are well known but just in case;
mmap/munmap intensive workloads may get hammered on taking mmap_sem for
write. The greatest costs are incurred if the application is threaded
if the parallel threads are fault-intensive. I do not think this is the
case for PostgreSQL as it is process based but it is a concern. Even it's
a single-threaded process, the cost of the mmap_sem cache line bouncing
can be a concern. Outside of that, the mmap/munmap paths are just really
costly and take a lot of work.
madvise has different hazards but lets take DONTNEED as an example because
it's the most likely candidate for use. A DONTNEED hint has three potential
downsides. The first is that mmap_sem taken for read can be very costly
for threaded applications as the cache line bounces. On NUMA machines it
can be a major problem for madvise-intensive workloads. The second is that
the page table teardown frees the pages with the associated costs but most
importantly, an IPI is required afterwards to flush the TLB. If that process
has been running on a lot of different CPUs then the IPI cost can be very
high. The third hazard is that a madvise(DONTNEED) region will incur page
faults on the next accesses again hammering into mmap_sem and all the faults
associated with faulting (allocating the same pages again, zeroing etc)
It may be the case that mmap/madvise is still required to handle a double
buffering problem but it's far from being a free lunch and it has costs
that read/write does not have to deal with. Maybe some of these problems
can be fixed or mitigated but it is a case where a test case demonstrates
the problem even if that requires patching PostgreSQL.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: