However the really major difficulty with using mmap is that it breaks the scheme we are currently using for WAL, because you don't have any way to restrict how soon a change in an mmap'd page will go to disk. (No, I don't believe that mlock guarantees this. It says that the page will not be removed from main memory; it does not specify that, say, the syncer won't write the contents out anyway.)
I had to think about this for a minute (now nearly a week) and reread the docs on WAL before I groked what could happen here. You're absolutely right in that WAL needs to be taken into account first. How does this execution path sound to you?
By default, all mmap(2)'ed pages are MAP_SHARED. There are no complications with regards to reads.
When a backend wishes to write a page, the following steps are taken:
1) Backend grabs a lock from the lockmgr to write to the page (exactly as it does now)
2) Backend mmap(2)'s a second copy of the page(s) being written to, this time with the MAP_PRIVATE flag set. Mapping a copy of the page again is wasteful in terms of address space, but does not require any more memory than our current scheme. The re-mapping of the page with MAP_PRIVATE prevents changes to the data that other backends are viewing.
3) The writing backend, can then scribble on its private copy of the page(s) as it sees fit.
4) Once completed making changes and a transaction is to be committed, the backend WAL logs its changes.
5) Once the WAL logging is complete and it has hit the disk, the backend msync(2)'s its private copy of the pages to disk (ASYNC or SYNC, it doesn't really matter too much to me).
6) Optional(?). I'm not sure whether or not the backend would need to also issues an msync(2) MS_INVALIDATE, but, I suspect it would not need to on systems with unified buffer caches such as FreeBSD or OS-X. On HPUX, or other older *NIX'es, it may be necessary. *shrug* I could be trying to be overly protective here.
7) Backend munmap(2)'s its private copy of the written on page(s).
8) Backend releases its lock from the lockmgr.
At this point, the remaining backends now are able to see the updated pages of data.
Let's look at what happens with a read(2) call. To read(2) data you have to have a block of memory to copy data into. Assume your OS of choice has a good malloc(3) implementation and it only needs to call brk(2) once to extend the process's memory address after the first malloc(3) call. There's your first system call, which guarantees one context switch.
Wrong. Our reads occur into shared memory allocated at postmaster startup, remember?
Doh. Fair enough. In most programs that involve read(2), a call to alloc(3) needs to be made.
mmap(2) is a totally different animal in that you don't ever need to
make calls to read(2): mmap(2) is used in place of those calls (With
#ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it
was working with a page of mmap(2)'ed data or need to know that it is).
Instead, you have to worry about address space management and keeping a consistent view of the data.
Which is largely handled by mmap() and the VM.
... If a write(2) system call is issued on a page of mmap(2)'ed data (and your operating system supports it, I know FreeBSD does, but don't think Linux does), then the page of data is DMA'ed by the network controller and sent out without the data needing to be copied into the network controller's buffer.
Perfectly irrelevant to Postgres, since there is no situation where we'd
ever write directly from a disk buffer to a socket; in the present
implementation there are at least two levels of copy needed in between
(datatype-specific output function and protocol message assembly). And
that's not even counting the fact that any data item large enough to
make the savings interesting would have been sliced, diced, and
compressed by TOAST.
The biggest winners will be columns whos storage type is PLAIN or EXTERNAL. writev(2) from mmap(2)'ed pages and non-mmap(2)'ed pages would be a nice perk too (not sure if PostgreSQL uses this or not). Since compression isn't happening on most tuples under 1K in size and most tuples in a database are going to be under that, most tuples are going to be uncompressed. Total pages for the database, however, is likely a different story. For large tuples that are uncompressed and larger than a page, it is probably beneficial to use sendfile(2) instead of mmap(2) + write(2)'ing the page/file.
If a large tuple is compressed, it'd be interesting to see if it'd be worthwhile to have the data uncompressed onto an anonymously mmap(2)'ed page(s) that way the benefits of zero-socket-copies could be used.
shared mem is a bastardized subsystem that works, but isn't integral to
any performance areas in the kernel so it gets neglected.
What performance issues do you think shared memory needs to have fixed?
We don't issue any shmem kernel calls after the initial shmget, so
comparing the level of kernel tenseness about shmget to the level of
tenseness about mmap is simply irrelevant. Perhaps the reason you don't
see any traffic about this on the kernel lists is that shared memory
already works fine and doesn't need any fixing.
I'm gunna get flamed for this, but I think its improperly used as a second level cache on top of the operating system's cache. mmap(2) would consolidate all caching into the kernel.
Please ask questions if you have them.
Do you have any arguments that are actually convincing?
Three things come to mind.
1) A single cache for pages
2) Ability to give access hints to the kernel regarding future IO
3) On the fly memory use for a cache. There would be no need to preallocate slabs of shared memory on startup.
And a more minor point would be:
4) Not having shared pages get lost when the backend dies (mmap(2) uses refcounts and cleans itself up, no need for ipcs/ipcrm/ipcclean). This isn't too practical in production though, but it sucks doing PostgreSQL development on OS-X because there is no ipcs/ipcrm command.
What I just read was a proposal to essentially throw away not only the entire
low-level data access model, but the entire low-level locking model,
and start from scratch.
From the above list, steps 2, 3, 5, 6, and 7 would be different than our current approach, all of which could be safely handled with some #ifdef's on platforms that don't have mmap(2).
There is no possible way we could support both this approach and the current one, which means that we'd be permanently dropping support for all platforms without high-quality mmap implementations;
Architecturally, I don't see anything different or incompatibilities that aren't solved with an #ifdef USE_MMAP/#else/#endif.
Furthermore, you didn't give any really convincing reasons to think that the enormous effort involved would be repaid.
Steven's has a great reimplementaion of cat(1) that uses mmap(1) and benchmarks the two. I did my own version of that here:
When read(2)'ing/write(2)'ing /etc/services 100,000 times without mmap(2), it takes 82 seconds. With mmap(2), it takes anywhere from 1.1 to 18 seconds. Worst case scenario with mmap(2) yields a speedup by a factor of four. Best case scenario... *shrug* something better than 4x. I doubt PostgreSQL would see 4x speedups in the IO department, but I do think it would be vastly greater than the 3% suggested.
Those oprofile reports Josh just put up showed 3% of the CPU time going into userspace/kernelspace copying. Even assuming that that number consists entirely of reads and writes of shared buffers (and of course no other kernel call ever transfers any data across that boundary ;-)), there's no way we are going to buy into this sort of project in hopes of a 3% win.
Would it be helpful if I created a test program that demonstrated that the execution path for writing mmap(2)'ed pages as outlined above?
-- Sean Chittenden
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings