[HACKERS] Hint bits vs. OS readahead

Andrew Gierth Fri, 05 Dec 2008 18:14:15 -0800

While waiting for a large restore to complete (and investigating why
parts of it were so slow), I came across this scenario. This isn't
quite the same as some previous discussion of hint bits, but I thought
it was something that could probably be taken into account in future.
This also may be relevent to the tuplestore discussion.


The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB

The scenario: pg_restore of a dump containing a large partitioned table
(a dozen partitions of ~10GB each). The actual loading of the data
proceeds as expected, the interesting part is the creation of indexes
afterwards.

Watching the progress of the backend, a large proportion of the time
is taken up by the heap scan to retrieve the data. The problem is, of
course, that the backend settles down into an access pattern like this:

  lseek(0x64,0x3ef7c000,SEEK_SET)
  read(0x64,0x864123340,0x2000) => 8192/0x2000
  lseek(0x64,0x3ef3e000,SEEK_SET)
  write(0x64,0x864125340,0x2000) => 8192/0x2000

where fd 0x64 is the table heap file; the read is obvious, the write is
caused by writing a previously hinted page back to disk when the backend
wants to reuse the buffer. Notice that this write is happening in the
same backend (and on the same fd).

At least on unpatched FreeBSD this access pattern destroys OS-level
readahead, though lower-level readahead on the actual disk drives
themselves hides this fact to a significant extent (each read() call
forces a SCSI transaction, but this transaction completes quite
quickly due to read caching on the drive).

In order to test how bad the effect was, I patched FreeBSD to use
separate sequential-behaviour tracking for reads and writes (this
patch turns out to be trivial, affecting only a couple of dozen
lines). The effect was fairly dramatic; the total time taken for
CREATE INDEX was cut by a factor of slightly better than 2 (typically
from ~700 seconds per partition to ~320 seconds on my data).

[for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]

The obvious question is whether this is something which should be left
as the OS'es problem, or whether it would be worth having pg do some
special handling of file opens to distinguish read and write accesses,
or sequential from random accesses when both are likely to be happening
at the same time. I've so far had conflicting answers about how well
Linux handles this case (and not being a Linux user I have no easy way
to test it myself).

-- 
Andrew (irc:RhodiumToad)

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hint bits vs. OS readahead

Reply via email to