pg to my mind is unique in not trying to avoid OS buffering. Other dbmses spend a substantial effort to create a virtual OS (task management, I/O drivers, etc.) both in code and support. Choosing mmap seems such a limiting an option - it adds OS dependency and limits kernel developer options (2G limits, global mlock serializations, porting problems, inability to schedule or parallelize I/O, still having to coordinate writers and readers).
2G limits? That must be a Linux limitation, not a limitation with mmap(2). On OS-X and FreeBSD it's anywhere from 4GB to ... well, whatever the 64bit limit is (which is bigger than any data file in $PGDATA). An mlock(2) serialization problem is going to be cheaper than hitting the disk in nearly all cases and should be no worse than a context switch or semaphore (what we use for the current locking scheme), of which PostgreSQL causes plenty of 'em because it's multi-process, not multi-threaded. Coordination of data isn't necessary if you mmap(2) data as a private block, which takes a snapshot of the page at the time you make the mmap(2) call and gets copied only when the page is written to. More on that later.
I'm not sure I entirely agree with this. Whether you access a file via mmap() or via read(), the end result is that you still have to access it, and since PG has significant chunks of system-dependent code that it heavily relies on as it is (e.g., locking mechanisms, shared memory), writing the I/O subsystem in a similar way doesn't seem to me to be that much of a stretch (especially since PG already has the storage manager), though it might involve quite a bit of work.
Obviously you have to access the file on the hard drive, but you're forgetting an enormous advantage of mmap(2). With a read(2) system call, the program has to allocate space for the read(2), then it copies data from the kernel into the allocated memory in the userland's newly allocated memory location. With mmap(2) there is no second copy.
Let's look at what happens with a read(2) call. To read(2) data you have to have a block of memory to copy data into. Assume your OS of choice has a good malloc(3) implementation and it only needs to call brk(2) once to extend the process's memory address after the first malloc(3) call. There's your first system call, which guarantees one context switch. The second hit, a much larger hit, is the actual read(2) call itself, wherein the kernel has to copy the data twice: once into a kernel buffer, then from the kernel buffer into the userland's memory space. Yuk. Webserver's figured this out long ago that read(2) is slow and evil in terms of performance. Apache uses mmap(2) to send static files at performance levels that don't suck and is actually quite fast (in terms of responsiveness, I'm not talking about Apache's parallelism/concurrency performance levels... which in 1.X aren't great).
mmap(2) is a totally different animal in that you don't ever need to make calls to read(2): mmap(2) is used in place of those calls (With #ifdef and a good abstraction, the rest of PostgreSQL wouldn't know it was working with a page of mmap(2)'ed data or need to know that it is). Instead you mmap(2) a file descriptor and the kernel does some heavy lifting/optimized magic in its VM. The kernel reads the file descriptor and places the data it reads into its buffer (exactly the same as what happens with read(2)), but, instead of copying the data to the userspace, mmap(2) adjusts the process's address space and maps the address of the kernel buffer into the process's address space. No copying necessary. The savings here are *huge*!
Depending on the mmap(2) implementation, the VM may not even get a page from disk until its actually needed. So, lets say you mmap(2) a 16M file. The address space picks up an extra 16M of bits that the process *can* use, but doesn't necessarily use. So if a user reads only ten pages out of a 16MB file, only 10 pages (10 * getpagesize()), or usually 40,960K, which is 0.24% the amount of disk access (((4096 * 10) / (16 *1024 * 1024)) * 100). Did I forget to mention that if the file is already in the kernel's buffers, there's no need for the kernel to access the hard drive? Another big win for data that's hot/frequently accessed.
There's another large savings if the machine is doing network IO too...
As for parallelization of I/O, the use of mmap() for reads should signficantly improve parallelization -- now instead of issuing read() system calls, possibly for the same set of blocks, all the backends would essentially be examining the same data directly. The performance improvements as a result of accessing the kernel's cache pages directly instead of having it do buffer copies to process-local memory should increase as concurrency goes up. But see below.
That's kinda true... though not quite correct. The improvement in IO concurrency comes from zero-socket-copy operations from the disk to the network controller. If a write(2) system call is issued on a page of mmap(2)'ed data (and your operating system supports it, I know FreeBSD does, but don't think Linux does), then the page of data is DMA'ed by the network controller and sent out without the data needing to be copied into the network controller's buffer. So, instead of the CPU copying data from the OS's buffer to a kernel buffer, the network card grabs the chunk of data in one interrupt because of the DMA (direct memory access). This is a pretty big deal for web serving, but if you've got a database sending large sets of data over the network, assuming the network isn't the bottle neck, this results in a heafty performance boost (that won't be noticed by most until they're running huge, very busy installations). This optimization comes for free and without needing to add one line of code to an application once mmap(2) has been added to an application.
More to the point, I think it is very hard to effectively coordinate multithreaded I/O, and mmap seems used mostly to manage relatively simple scenarios.
PG already manages and coordinates multithreaded I/O. The mechanisms used to coordinate writes needn't change at all. But the way reads are done relative to writes might have to be rethought, since an mmap()ed buffer always reflects what's actually in kernel space at the time the buffer is accessed, while a buffer retrieved via read() reflects the state of the file at the time of the read(). If it's necessary for the state of the buffers to be fixed at examination time, then mmap() will be at best a draw, not a win.
Here's where things can get interesting from a transaction stand point. Your statement is correct up until you make the assertion that a page needs to be fixed. If you're doing a read(2) transaction, mmap(2) a region and set the MAP_PRIVATE flag so the ground won't change underneath you. No copying of this page is done by the kernel unless it gets written to. If you're doing a write(2) or are directly scribbling on an mmap(2)'ed page, you need to grab some kind of an exclusive lock on the page/file (mlock(2) is going to be no more expensive than a semaphore, but probably less expensive). We already do that with semaphores, however. So for databases that don't have high contention for the same page/file of data, there are no additional copies made. When a piece of data is written, a page is duplicated before it gets scribbled on, but the application never knows this happens. The next time a process mmap(2)'s a region of memory that's been written to, it'll get the updated data without any need to flush a cache or mark pages as dirty: the operating system does all of this for us (and probably faster too). mmap(2) implementations are, IMHO, more optimized that shared memory implementations (mmap(2) is a VM function, which gets many eyes to look it over and is always being tuned, whereas shared mem is a bastardized subsystem that works, but isn't integral to any performance areas in the kernel so it gets neglected. Just my observations from the *BSD commit lists. Linux it may be different).
 I forgot to mention earlier, you don't have to write(2) data to a file if it's mmap(2)'ed, you can change the contents of an mmap(2)'ed region, then msync(2) it back to disk (to ensure it gets written out) or let the last munmap(2) call do that for you (which would be just as dangerous as running without fsync... but would result in an additional performance boost).
mmap doesn't look that promising.
This ultimately depends on two things: how much time is spent copying buffers around in kernel memory, and how much advantage can be gained by freeing up the memory used by the backends to store the backend-local copies of the disk pages they use (and thus making that memory available to the kernel to use for additional disk buffering).
Someone on IRC pointed me to some OSDL benchmarks, which broke down where time is being spent. Want to know what the most expensive part of PostgreSQL is? *drum roll*
3967393 total 1.7735 2331284 default_idle 36426.3125 825716 do_sigaction 1290.1813 133126 __copy_from_user_ll 1040.0469 97780 __copy_to_user_ll 763.9062 43135 finish_task_switch 269.5938 30973 do_anonymous_page 62.4456 24175 scsi_request_fn 22.2197 23355 __do_softirq 121.6406 17039 __wake_up 133.1172 16527 __make_request 10.8730 9823 try_to_wake_up 13.6431 9525 generic_unplug_device 66.1458 8799 find_get_page 78.5625 7878 scsi_end_request 30.7734
Copying data to/from userspace and signal handling!!!! Let's hear it for the need for mmap(2)!!! *crowd goes wild*
The gains from the former are likely small. The gains from the latter are probably also small, but harder to estimate.
The use of mmap() is probably one of those optimizations that should be done when there's little else left to optimize, because the potential gains are possibly (if not probably) relatively small and the amount of work involved may be quite large.
If system/kernel time is where most of your database spends its time, then mmap(2) is a huge optimization that is very much worth pursuing. It's stable (nearly all webservers use it, notably Apache), widely deployed, POSIX specified (granted not all implementations are 100% consistent, but that's an OS bug and mmap(2) doesn't have to be turned on for those platforms: it's no worse than where we are now), and well optimized by operating system hackers. I guarantee that your operating system of choice has a faster VM and disk cache than PostgreSQL's userland cache, nevermind using the OSs buffers leads to many performance boosts as the OS can short-circuit common pathways that would require data copying (ex: zero-socket-copy operations and copying data to/from userland).
mmap(2) isn't a panacea or replacement for good software design, but it certainly does make IO operations vastly faster, which is what PostgreSQL does a lot of (hence its need for a userland cache). Remember, back when PostgreSQL had its architecture thunk up, mmap(2) hardly existed in anyone's eyes, nevermind it being widely used or a POSIX function. It wasn't until Apache started using it that Operating System vendors felt the need to implement it or make it work well. Now it's integral to nearly all virtual memory implementations and a modern OS can't live without it or have it broken in any way. It would be largely beneficial to PostgreSQL to heavily utilize mmap(2).
A few places it should be used include:
*) Storage. It is a good idea to mmap(2) all files instead of read(2)'ing files. mmap(2) doesn't fetch a page from disk until its actually needed, which is a nifty savings. Sure it causes a fault in the kernel, but it won't the second time that page is accessed. Changes are necessary to src/backend/storage/file/, possibly src/backend/storage/freespace/ (why is it using fread(3) and not read(2)?), src/backend/storage/large_object/ can remain gimpy since people should use BYTEA instead (IMHO), src/backend/storage/page/ doesn't need changes (I don't think), src/backend/storage/smgr/ shouldn't need any modifications either.
*) ARC. Why unmmap(2) data if you don't need to? With ARC, it's possible for the database to coach the operating system in what pages should be persistent. ARC's a smart algorithm for handling the needs of a database. Instead of having a cache of pages in userland, PostgreSQL would have a cache of mmap(2)'ed pages. It's shared between processes, the changes are public to external programs read(2)'ing data, and its quick. The needs for shared memory by the kernel drops to nearly nothing. The needs for mmap(2)'able space in the kernel, however, does go up. Unlike SysV shared mem, this can normally be changed on the fly. The end result would be, if a page is needed, it checks to see if its in the cache. If it is, the mmap(2)'ed page is returned. If it isn't, the page gets read(2)/mmap(2) like it currently is loaded (except in the mmap(2) case where after the data has been loaded, the page gets munmap(2)'ed). If ARC decides to keep the page, the page doesn't get munmap(2)'ed. I don't think any changes need to be made though to take advantage of mmap(2) if the changes are made in the places mentioned above in the Storage point.
A few other perks:
*) DIRECTIO can be used without much of a cache coherency headache since the cache of data is in the kernel, not userland.
*) NFS. I'm not suggesting multiple clients use the same data directory via NFS (unless read only), but if there were a single client accessing a data directory over NFS, performance would be much better than it is today because data consistency is handled by the kernel so in flight packets for writes that get dropped or lost won't cause a slow down (mmap(2) behaves differently with NFS pages) or corruption.
*) mmap(2) is conditional on the operating system's abilities, but doesn't require any architectural changes. It does change the location of the cache, from being in the userland, down in to the kernel. This is a change for database administrators, but a good one, IMHO. Previously, the operating system would be split 25% kernel, 75% user because PostgreSQL would need the available RAM for its cache. Now, that can be moved closer to the opposite, 75% kernel, 25% user because most of the memory is mmap(2)'ed pages instead of actual memory in the userland.
*) Pages can be protected via PROT_(EXEC|READ|WRITE). For backends that aren't making changes to the DDL or system catalogs (permissions, etc.), pages that are loaded from the catalogs could be loaded with the protection PROT_READ, which would prevent changes to the catalogs. All DDL and permission altering commands (anything that touches the system catalogs) would then load the page with the PROT_WRITE bit set, make their changes, then PROT_READ the page again. This would provide a first line of defense against buggy programs or exploits.
*) Eliminates the double caching done currently (caching in PostgreSQL and the kernel) by pushing the cache into the kernel... but without PostgreSQL knowing it's working on a page that's in the kernel.
Please ask questions if you have them.
-- Sean Chittenden
---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])