On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
<james.bottom...@hansenpartnership.com> wrote:
> On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
>> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
>> <james.bottom...@hansenpartnership.com> wrote:
>> > No, I'm sorry, that's never going to be possible.  No user space
>> > application has all the facts.  If we give you an interface to force
>> > unconditional holding of dirty pages in core you'll livelock the system
>> > eventually because you made a wrong decision to hold too many dirty
>> > pages.   I don't understand why this has to be absolute: if you advise
>> > us to hold the pages dirty and we do up until it becomes a choice to
>> > hold on to the pages or to thrash the system into a livelock, why would
>> > you ever choose the latter?  And if, as I'm assuming, you never would,
>> > why don't you want the kernel to make that choice for you?
>> If you don't understand how write-ahead logging works, this
>> conversation is going nowhere.  Suffice it to say that the word
>> "ahead" is not optional.
> No, I do ... you mean the order of write out, if we have to do it, is
> important.  In the rest of the kernel, we do this with barriers which
> causes ordered grouping of I/O chunks.  If we could force a similar
> ordering in the writeout code, is that enough?

Probably not.  There are a whole raft of problems here.  For that to
be any of any use, we'd have to move to mmap()ing each buffer instead
of read()ing them in, and apparently mmap() doesn't scale well to
millions of mappings.  And even if it did, then we'd have a solution
that only works on Linux.  Plus, as Tom pointed out, there are
critical sections where it's not just a question of ordering but in
fact you need to completely hold off writes.

In terms of avoiding double-buffering, here's my thought after reading
what's been written so far.  Suppose we read a page into our buffer
pool.  Until the page is clean, it would be ideal for the mapping to
be shared between the buffer cache and our pool, sort of like
copy-on-write.  That way, if we decide to evict the page, it will
still be in the OS cache if we end up needing it again (remember, the
OS cache is typically much larger than our buffer pool).  But if the
page is dirtied, then instead of copying it, just have the buffer pool
forget about it, because at that point we know we're going to write
the page back out anyway before evicting it.

This would be pretty similar to copy-on-write, except without the
copying.  It would just be forget-from-the-buffer-pool-on-write.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to