On Dec 5, 2008, at 7:50 PM, Andrew Gierth wrote:
While waiting for a large restore to complete (and investigating why
parts of it were so slow), I came across this scenario. This isn't
quite the same as some previous discussion of hint bits, but I thought
it was something that could probably be taken into account in future.
This also may be relevent to the tuplestore discussion.

The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB

The scenario: pg_restore of a dump containing a large partitioned table
(a dozen partitions of ~10GB each). The actual loading of the data
proceeds as expected, the interesting part is the creation of indexes
afterwards.

Watching the progress of the backend, a large proportion of the time
is taken up by the heap scan to retrieve the data. The problem is, of
course, that the backend settles down into an access pattern like this:

  lseek(0x64,0x3ef7c000,SEEK_SET)
  read(0x64,0x864123340,0x2000) => 8192/0x2000
  lseek(0x64,0x3ef3e000,SEEK_SET)
  write(0x64,0x864125340,0x2000) => 8192/0x2000

where fd 0x64 is the table heap file; the read is obvious, the write is caused by writing a previously hinted page back to disk when the backend
wants to reuse the buffer. Notice that this write is happening in the
same backend (and on the same fd).

At least on unpatched FreeBSD this access pattern destroys OS-level
readahead, though lower-level readahead on the actual disk drives
themselves hides this fact to a significant extent (each read() call
forces a SCSI transaction, but this transaction completes quite
quickly due to read caching on the drive).

In order to test how bad the effect was, I patched FreeBSD to use
separate sequential-behaviour tracking for reads and writes (this
patch turns out to be trivial, affecting only a couple of dozen
lines). The effect was fairly dramatic; the total time taken for
CREATE INDEX was cut by a factor of slightly better than 2 (typically
from ~700 seconds per partition to ~320 seconds on my data).

[for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]

The obvious question is whether this is something which should be left
as the OS'es problem, or whether it would be worth having pg do some
special handling of file opens to distinguish read and write accesses,
or sequential from random accesses when both are likely to be happening
at the same time. I've so far had conflicting answers about how well
Linux handles this case (and not being a Linux user I have no easy way
to test it myself).

We don't do restores very often, but we have noticed that recovery mode is painfully slow for us, either from a crash or to bring up a PITR snapshot. We're running on 16 core IA64 machines with 96GB hitting iSCSI SANs (some SATA, some SAS). Under ideal conditions, PG can read or write at 100+MB/s. Typically, we're pushing ~5MB/s, but during recovery we'll only do 600-700kB/s. I've never straced a backend to see exactly what's going on.
--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to