On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas <hlinnakan...@vmware.com> wrote: >> The only feedback we have on how bad things are is how long it took >> the last fsync to complete, so I actually think that's a much better >> way to go than any fixed sleep - which will often be unnecessarily >> long on a well-behaved system, and which will often be far too short >> on one that's having trouble. I'm inclined to think think Kondo-san >> has got it right. > > Quite possible, I really don't know. I'm inclined to first try the simplest > thing possible, and only make it more complicated if that's not good enough. > Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep > between every fsync, unless you're behind the schedule, is even simpler.
I'm pretty sure Greg Smith tried it the fixed-sleep thing before and it didn't work that well. I have also tried it and the resulting behavior was unimpressive. It makes checkpoints take a long time to complete even when there's very little data to flush out to the OS, which is annoying; and when things actually do get ugly, the sleeps aren't long enough to matter. See the timings Kondo-san posted downthread: 100ms delays aren't going let the system recover in any useful way when the fsync can take 13 s for one file. On a system that's badly weighed down by I/O, the fsync times are often *extremely* long - 13 s is far from the worst you can see. You have to give the system a meaningful time to recover from that, allowing other processes to make meaningful progress before you hit it again, or system performance just goes down the tubes. Greg's test, IIRC, used 3 s sleeps rather than your proposal of 100 ms, but it still wasn't enough. > In > particular, it's easier to tie that into the checkpoint scheduler - I'm not > sure how you'd measure progress or determine how long to sleep unless you > assume that every fsync is the same. I think the thing to do is assume that the fsync phase will take 10% or so of the total checkpoint time, but then be prepared to let the checkpoint run a bit longer if the fsyncs end up being slow. As Greg has pointed out during prior discussions of this, the normal scenario when things get bad here is that there is no way in hell you're going to fit the checkpoint into the originally planned time. Once all of the write caches between PostgreSQL and the spinning rust are full, the system is in trouble and things are going to suck. The hope is that we can stop beating the horse while it is merely in intensive care rather than continuing until the corpse is fully skeletized. Fixed delays don't work because - to push an already-overdone metaphor a bit further - we have no idea how much of a beating the horse can take; we need something adaptive so that we respond to what actually happens rather than making predictions that will almost certainly be wrong a large fraction of the time. To put this another way, when we start the fsync() phase, it often consumes 100% of the available I/O on the machine, completing starving every other process that might need any. This is certainly a deficiency in the Linux I/O scheduler, but as they seem in no hurry to fix it we'll have to cope with it as best we can. If you do the fsyncs in fast succession (and 100ms separation might as well be no separation at all), then the I/O starvation of the entire system persists through the entire fsync phase. If, on the other hand, you sleep for the same amount of time the previous fsync took, then on the average, 50% of the machine's I/O capacity will be available for all other system activity throughout the fsync phase, rather than 0%. Now, unfortunately, this is still not that good, because it's often the case that all of the fsyncs except one are reasonably fast, and there's one monster one that is very slow. ext3 has a known bad behavior that dumps all dirty data for the entire *filesystem* when you fsync, which tends to create these kinds of effects. But even on better-behaved filesystem, like ext4, it's fairly common to have one fsync that is painfully longer than all the others. So even with this patch, there are still going to be cases where the whole system becomes unresponsive. I don't see any way to to do better without a better kernel API, or a better I/O scheduler, but that doesn't mean we shouldn't do at least this much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers