After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres.
I did a bunch of oprofile measurements on a machine belonging to one of Josh's clients, using a test case that involved heavy concurrent access to a relatively small amount of data (little enough to fit into Postgres shared buffers, so that no I/O or kernel calls were really needed once the test got going). I found that by nearly any measure --- elapsed time, bus transactions, or machine-clear events --- the spinlock acquisitions associated with grabbing and releasing the BufMgrLock took an unreasonable fraction of the time. I saw about 15% of elapsed time, 40% of bus transactions, and nearly 100% of pipeline-clear cycles going into what is essentially two instructions out of the entire backend. (Pipeline clears occur when the cache coherency logic detects a memory write ordering problem.) I am not completely clear on why this machine-level bottleneck manifests as a lot of context swaps at the OS level. I think what is happening is that because SpinLockAcquire is so slow, a process is much more likely than you'd normally expect to arrive at SpinLockAcquire while another process is also acquiring the spinlock. This puts the two processes into a "lockstep" condition where the second process is nearly certain to observe the BufMgrLock as locked, and be forced to suspend itself, even though the time the first process holds the BufMgrLock is not really very long at all. If you google for Xeon and "cache coherency" you'll find quite a bit of suggestive information about why this might be more true on the Xeon setup than others. A couple of interesting hits: http://www.theinquirer.net/?article=10797 says that Xeon MP uses a *slower* FSB than Xeon DP. This would translate directly to more time needed to transfer a dirty cache line from one processor to the other, which is the basic operation that we're talking about here. http://www.aceshardware.com/Spades/read.php?article_id=30000187 says that Opterons use a different cache coherency protocol that is fundamentally superior to the Xeon's, because dirty cache data can be transferred directly between two processor caches without waiting for main memory. So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any evidence for or against that idea, but it's certainly possible.) In the long run, however, CPUs continue to get faster than main memory and the price of cache contention will continue to rise. So it seems that we need to give up the assumption that SpinLockAcquire is a cheap operation. In the presence of heavy contention it won't be. One thing we probably have got to do soon is break up the BufMgrLock into multiple finer-grain locks so that there will be less contention. However I am wary of doing this incautiously, because if we do it in a way that makes for a significant rise in the number of locks that have to be acquired to access a buffer, we might end up with a net loss. I think Neil Conway was looking into how the bufmgr might be restructured to reduce lock contention, but if he had come up with anything he didn't mention exactly what. Neil? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])