On 1/14/14, 10:08 AM, Tom Lane wrote:
Trond Myklebust <tron...@gmail.com> writes:
On Jan 14, 2014, at 10:39, Tom Lane <t...@sss.pgh.pa.us> wrote:
"Don't be aggressive" isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.

Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.

As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
"you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there".  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the "current"
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.

For the sake of completeness... it's theoretically silly that Postgres is doing 
all this stuff with WAL when the filesystem is doing something very similar 
with it's journal. And an SSD drive (and next generation spinning rust) is 
doing the same thing *again* in it's own journal.

If all 3 communities (or even just 2 of them!) could agree on the necessary 
interface a tremendous amount of this duplicated technology could be eliminated.

That said, I rather doubt the Postgres community would go this route, not so much because 
of the presumably massive changes needed, but more because our community is not a fan of 
restricting our users to things like "Thou shalt use a journaled FS or risk all thy 
data!"

Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.

Yet another problem with trying to combine database and journaled FS efforts... 
:(
--
Jim C. Nasby, Data Architect                       j...@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to