Tom Lane wrote: > Kevin Brown <[EMAIL PROTECTED]> writes: > > Tom Lane wrote: > >> mmap() is Right Out because it does not afford us sufficient control > >> over when changes to the in-memory data will propagate to disk. > > > ... that's especially true if we simply cannot > > have the page written to disk in a partially-modified state (something > > I can easily see being an issue for the WAL -- would the same hold > > true of the index/data files?). > > You're almost there. Remember the fundamental WAL rule: log entries > must hit disk before the data changes they describe. That means that we > need not only a way of forcing changes to disk (fsync) but a way of > being sure that changes have *not* gone to disk yet. In the existing > implementation we get that by just not issuing write() for a given page > until we know that the relevant WAL log entries are fsync'd down to > disk. (BTW, this is what the LSN field on every page is for: it tells > the buffer manager the latest WAL offset that has to be flushed before > it can safely write the page.) > > mmap provides msync which is comparable to fsync, but AFAICS it > provides no way to prevent an in-memory change from reaching disk too > soon. This would mean that WAL entries would have to be written *and > flushed* before we could make the data change at all, which would > convert multiple updates of a single page into a series of write-and- > wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction > is bad enough, once per atomic action is intolerable.
Hmm...something just occurred to me about this. Would a hybrid approach be possible? That is, use mmap() to handle reads, and use write() to handle writes? Any code that wishes to write to a page would have to recognize that it's doing so and fetch a copy from the storage manager (or something), which would look to see if the page already exists as a writeable buffer. If it doesn't, it creates it by allocating the memory and then copying the page from the mmap()ed area to the new buffer, and returning it. If it does, it just returns a pointer to the buffer. There would obviously have to be some bookkeeping involved: the storage manager would have to know how to map a mmap()ed page back to a writeable buffer and vice-versa, so that once it decides to write the buffer it can determine which page in the original file the buffer corresponds to (so it can do the appropriate seek()). In a write-heavy database, you'll end up with a lot of memory copy operations, but with the scheme we currently use you get that anyway (it just happens in kernel code instead of user code), so I don't see that as much of a loss, if any. Where you win is in a read-heavy database: you end up being able to read directly from the pages in the kernel's page cache and thus save a memory copy from kernel space to user space, not to mention the context switch that happens due to issuing the read(). Obviously you'd want to mmap() the file read-only in order to prevent the issues you mention regarding an errant backend, and then reopen the file read-write for the purpose of writing to it. In fact, you could decouple the two: mmap() the file, then close the file -- the mmap()ed region will remain mapped. Then, as long as the file remains mapped, you need to open the file again only when you want to write to it. -- Kevin Brown [EMAIL PROTECTED] ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend