I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not "scan
resistant" as advertised.  A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and
"memtable" is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation               memtable        bigtable
---------------------------------------------------
SELECT COUNT(*)         1221 MB/s       973 MB/s
VACUUM                  1709 MB/s       1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT.  In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well.  The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache.  Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it.  In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation               memtable        bigtable
---------------------------------------------------
SELECT COUNT(*)         1320 MB/s       1059 MB/s
VACUUM                  3033 MB/s       1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm?  Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.
 
- Luke



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to