Trond Myklebust <> writes:
> On Jan 14, 2014, at 10:39, Tom Lane <> wrote:
>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>> has to be absolute, because writing a dirty page before we've done
>> whatever else we need to do results in a corrupt database.  It has to
>> be treated like a write barrier.

> Then why are you dirtying the page at all? It makes no sense to tell the 
> kernel “we’re changing this page in the page cache, but we don’t want you to 
> change it on disk”: that’s not consistent with the function of a page cache.

As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
"you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there".  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the "current"
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.

Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.

                        regards, tom lane

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to