On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote:
> On 01/14/2014 09:39 AM, Claudio Freire wrote:
> > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <ha...@2ndquadrant.com>
> > wrote:
> >> Again, as said above the linux file system is doing fine. What we
> >> want is a few ways to interact with it to let it do even better when
> >> working with postgresql by telling it some stuff it otherwise would
> >> have to second guess and by sometimes giving it back some cache
> >> pages which were copied away for potential modifying but ended
> >> up clean in the end.
> > You don't need new interfaces. Only a slight modification of what
> > fadvise DONTNEED does.
> > This insistence in injecting pages from postgres to kernel is just a
> > bad idea.
> Do you think it would be possible to map copy-on-write pages
> from linux cache to postgresql cache ?
> this would be a step in direction of solving the double-ram-usage
> of pages which have not been read from syscache to postgresql
> cache without sacrificing linux read-ahead (which I assume does
> not happen when reads bypass system cache).
The current mechanism for coherency between a userspace cache and the
in-kernel page cache is mmap ... that's the only way you get the same
page in both currently.
glibc used to have an implementation of read/write in terms of mmap, so
it should be possible to insert it into your current implementation
without a major rewrite. The problem I think this brings you is
uncontrolled writeback: you don't want dirty pages to go to disk until
you issue a write() I think we could fix this with another madvise():
something like MADV_WILLUPDATE telling the page cache we expect to alter
the pages again, so don't be aggressive about cleaning them. Plus all
the other issues with mmap() ... but if you can detail those, we might
be able to fix them.
> and we can write back the copy at the point when it is safe (from
> postgresql perspective) to let the system write them back ?
Using MADV_WILLUPDATE, possibly ... you're still not going to have
absolute control. The kernel will write back the pages if the dirty
limits are exceeded, for instance, but we could tune it to be useful.
> Do you think it is possible to make it work with good performance
> for a few million 8kb pages ?
> > At the very least, it still needs postgres to know too much
> > of the filesystem (block layout) to properly work. Ie: pg must be
> > required to put entire filesystem-level blocks into the page cache,
> > since that's how the page cache works.
> I was more thinking of an simple write() interface with extra
> flags/sysctls to tell kernel that "we already have this on disk"
> > At the very worst, it may
> > introduce serious security and reliability implications, when
> > applications can destroy the consistency of the page cache (even if
> > full access rights are checked, there's still the possibility this
> > inconsistency might be exploitable).
> If you allow write() which just writes clean pages, I can not see
> where the extra security concerns are beyond what normal
> write can do.
The problem is we can't give you absolute control of when pages are
written back because that interface can be used to DoS the system: once
we get too many dirty uncleanable pages, we'll thrash looking for memory
and the system will livelock.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: