On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <da...@fromorbit.com> wrote:
> But there's something here that I'm not getting - you're talking
> about a data set that you want ot keep cache resident that is at
> least an order of magnitude larger than the cyclic 5-15 minute WAL
> dataset that ongoing operations need to manage to avoid IO storms.
> Where do these temporary files fit into this picture, how fast do
> they grow and why are do they need to be so large in comparison to
> the ongoing modifications being made to the database?

I'm not sure you've got that quite right.  WAL is fsync'd very
frequently - on every commit, at the very least, and multiple times
per second even there are no commits going on just to make sure we get
it all down to the platter as fast as possible.  The thing that causes
the I/O storm is the data file writes, which are performed either when
we need to free up space in PostgreSQL's internal buffer pool (aka
shared_buffers) or once per checkpoint interval (5-60 minutes) in any
event.  The point of this system is that if we crash, we're going to
need to replay all of the WAL to recover the data files to the proper
state; but we don't want to keep WAL around forever, so we checkpoint
periodically.  By writing all the data back to the underlying data
files, checkpoints render older WAL segments irrelevant, at which
point we can recycle those files before the disk fills up.

Temp files are something else again.  If PostgreSQL needs to sort a
small amount of data, like a kilobyte, it'll use quicksort.  But if it
needs to sort a large amount of data, like a terabyte, it'll use a
merge sort.[1]  The reason is of course that quicksort requires random
access to work well; if parts of quicksort's working memory get paged
out during the sort, your life sucks.  Merge sort (or at least our
implementation of it) is slower overall, but it only accesses the data
sequentially.  When we do a merge sort, we use files to simulate the
tapes that Knuth had in mind when he wrote down the algorithm.  If the
OS runs short of memory - because the sort is really big or just
because of other memory pressure - it can page out the parts of the
file we're not actively using without totally destroying performance.
It'll be slow, of course, because disks always are, but not like
quicksort would be if it started swapping.

I haven't actually experienced (or heard mentioned) the problem Jeff
Janes is mentioning where temp files get written out to disk too
aggressively; as mentioned before, the problems I've seen are usually
the other way - stuff not getting written out aggressively enough.
But it sounds plausible.  The OS only lets you set one policy, and if
you make that file right for permanent data files that get
checkpointed it could well be wrong for temp files that get thrown
out.  Just stuffing the data on RAMFS will work for some
installations, but might not be good if you actually do want to
perform sorts whose size exceeds RAM.

BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
in attending Collab on behalf of the PostgreSQL community.  Although
the prospect of a cross-country flight is a somewhat depressing
thought, it does sound pretty cool, so I'm potentially interested.  I
have no idea what the procedure is here for moving forward though,
especially since it sounds like there might be only one seat available
and I don't know who else may wish to sit in it.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] The threshold where we switch from quicksort to merge sort is a
configurable parameter.

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to