On 25.06.2013 23:03, Robert Haas wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakan...@vmware.com>  wrote:
I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.

Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.

Well, that's the point I was trying to make: you should sleep *before* the fsyncs get slow.

The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.

Quite possible, I really don't know. I'm inclined to first try the simplest thing possible, and only make it more complicated if that's not good enough. Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep between every fsync, unless you're behind the schedule, is even simpler. In particular, it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd measure progress or determine how long to sleep unless you assume that every fsync is the same.

I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small.  When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big.  But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase.  I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments<  10, use 1/checkpoint_segments.  Or just
assume the fsync phase will take 30 seconds.

If checkpoint_segments < 10, there isn't very much dirty data to flush out. This isn't really problem in that case - no matter how stupidly we do the writing and fsyncing. the I/O cache can absorb it. It doesn't really matter what we do in that case.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to