On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> >> Doesn't sound exactly like what I had in mind. What I was suggesting
> >> is an analogue of read() that, if it reads full pages of data to a
> >> page-aligned address, shares the data with the buffer cache until it's
> >> first written instead of actually copying the data.
> > The only way to make this happen is mmap the file to the buffer and use
> > MADV_WILLNEED.
> >> The pages are
> >> write-protected so that an attempt to write the address range causes a
> >> page fault. In response to such a fault, the pages become anonymous
> >> memory and the buffer cache no longer holds a reference to the page.
> > OK, so here I thought of another madvise() call to switch the region to
> > anonymous memory. A page fault works too, of course, it's just that one
> > per page in the mapping will be expensive.
> I don't think either of these ideas works for us. We start by
> creating a chunk of shared memory that all processes (we do not use
> threads) will have mapped at a common address, and we read() and
> write() into that chunk.
Yes, that's what I was thinking: it's a cache. About how many files
comprise this cache? Are you thinking it's too difficult for every
process to map the files?
> > Do you care about handling aliases ... what happens if someone else
> > reads from the file, or will that never occur? The reason for asking is
> > that it's much easier if someone else mmapping the file gets your
> > anonymous memory than we create an alias in the page cache.
> All reads and writes go through the buffer pool stored in shared
> memory, but any of the processes that have that shared memory region
> mapped could be responsible for any individual I/O request.
That seems to be possible with the abstraction. The initial mapping
gets the file backed pages: you can do madvise to read them (using
readahead), flush them (using wontneed) and flip them to anonymous
(using something TBD). Since it's a shared mapping API based on the
file, any of the mapping processes can do any operation. Future mappers
of the file get the mix of real and anon memory, so it's truly shared.
Given that you want to use this as a shared cache, it seems that the API
to flip back from anon to file mapped is wontneed. That would also
trigger writeback of any dirty pages in the previously anon region ...
which you could force with msync. As far as I can see, this is
identical to read/write on a shared region with the exception that you
don't need to copy in and out of the page cache.
>From our point of view, the implementation is nice because the pages
effectively never leave the page cache. We just use an extra per page
flag (which I'll get shot for suggesting) to alter the writeout path
(which is where the complexity which may kill the implementation is).
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: