While waiting for a large restore to complete (and investigating why parts of it were so slow), I came across this scenario. This isn't quite the same as some previous discussion of hint bits, but I thought it was something that could probably be taken into account in future. This also may be relevent to the tuplestore discussion.
The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM; 16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB The scenario: pg_restore of a dump containing a large partitioned table (a dozen partitions of ~10GB each). The actual loading of the data proceeds as expected, the interesting part is the creation of indexes afterwards. Watching the progress of the backend, a large proportion of the time is taken up by the heap scan to retrieve the data. The problem is, of course, that the backend settles down into an access pattern like this: lseek(0x64,0x3ef7c000,SEEK_SET) read(0x64,0x864123340,0x2000) => 8192/0x2000 lseek(0x64,0x3ef3e000,SEEK_SET) write(0x64,0x864125340,0x2000) => 8192/0x2000 where fd 0x64 is the table heap file; the read is obvious, the write is caused by writing a previously hinted page back to disk when the backend wants to reuse the buffer. Notice that this write is happening in the same backend (and on the same fd). At least on unpatched FreeBSD this access pattern destroys OS-level readahead, though lower-level readahead on the actual disk drives themselves hides this fact to a significant extent (each read() call forces a SCSI transaction, but this transaction completes quite quickly due to read caching on the drive). In order to test how bad the effect was, I patched FreeBSD to use separate sequential-behaviour tracking for reads and writes (this patch turns out to be trivial, affecting only a couple of dozen lines). The effect was fairly dramatic; the total time taken for CREATE INDEX was cut by a factor of slightly better than 2 (typically from ~700 seconds per partition to ~320 seconds on my data). [for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ] The obvious question is whether this is something which should be left as the OS'es problem, or whether it would be worth having pg do some special handling of file opens to distinguish read and write accesses, or sequential from random accesses when both are likely to be happening at the same time. I've so far had conflicting answers about how well Linux handles this case (and not being a Linux user I have no easy way to test it myself). -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers