I'm putting this out there before we publish a fix so that we can discuss how best to fix it.
Doug and Sherry recently found the source of an important performance issue with the Postgres shared buffer cache. The issue is summarized like this: the buffer cache in PGSQL is not "scan resistant" as advertised. A sequential scan of a table larger than cache will pollute the buffer cache in almost all circumstances. Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500 (thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and "memtable" is a table that fits into I/O cache: With our default setting of shared_buffers (16MB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1221 MB/s 973 MB/s VACUUM 1709 MB/s 1206 MB/s We had observed that VACUUM would perform better when done right after a SELECT. In the above example, the faster rate from disk was 1608 MB/s, compared to the normal rate of 1206 MB/s. We verified this behavior on Postgres 8.2 as well. The buffer selection algorithm is choosing buffer pages scattered throughout the buffer cache in almost all circumstances. Sherry traced the behavior to the processor repeatedly flushing the L2 cache. Doug found that we weren't using the Postgres buffer cache the way we expected, instead we were loading the scanned data from disk into the cache even though there was no possibility of reusing it. In addition to pushing other, possibly useful pages from the cache, it has the additional behavior of invalidating the L2 cache for the remainder of the executor path that uses the data. To prove that the buffer cache was the source of the problem, we dropped the shared buffer size to fit into L2 cache (1MB per Opteron core), and this is what we saw (drop size of shared buffers to 680KB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1320 MB/s 1059 MB/s VACUUM 3033 MB/s 1597 MB/s These results do not vary with the order of operations. Thoughts on the best way to fix the buffer selection algorithm? Ideally, one page would be used in the buffer cache in circumstances where the table to be scanned is (significantly?) larger than the size of the buffer cache. - Luke ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match