On 7/14/13 5:28 PM, james wrote:
Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes. In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS. WAL is going out constantly as well, and in many cases that's competing for the disks too. The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing. There is nothing happening for free. You're stealing I/O from something else any time you force a write out. The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL. Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again. And the buffers being written by the checkpointer are, on average, the most popular ones in the database. Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint. And that easy to measure waste--it shows as more writes/transaction in pg_stat_bgwriter--it hurts throughput more than every reduction in seek overhead you might otherwise get from early writes. The big gain isn't chasing after cheap seeks. The best path is the one that decreases the total volume of writes.

We played this game with the background writer work for 8.3. The main reason the one committed improved on the original design is that it completely eliminated doing work on popular buffers in advance. Everything happens at the last possible time, which is the optimal throughput situation. The 8.1/8.2 BGW used to try and write things out before they were strictly necessary, in hopes that that I/O would be free. But it rarely was, while there was always a cost to forcing them to disk early. And that cost is highest when you're talking about the higher usage blocks the checkpointer tends to write. When in doubt, always delay the write in hopes it will be written to again and you'll save work.

So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them...

Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening. We can't even easily make it know, not without adding a whole new source of IPC and locking contention among clients.

Whatever scheduling decision the checkpointer might make with its limited knowledge of system I/O is going to be poor. You might find a 100% write benchmark that it helps, but those are not representative of the real world. In any mixed read/write case, the operating system is likely to do better. That's why things like sorting blocks sometimes seem to help someone, somewhere, with one workload, but then aren't repeatable.

We can decide to trade throughput for latency by nudging the OS to deal with its queued writes more regularly. That will result in more total writes, which is the reason throughput drops.

But the idea that PostgreSQL is going to do a better global job of I/O scheduling, that road is a hard one to walk. It's only going to happen if we pull all of the I/O into the database *and* do a better job on the entire process than the existing OS kernel does. That sort of dream, of outperforming the filesystem, it is very difficult to realize. There's a good reason that companies like Oracle stopped pushing so hard on recommending raw partitions.

Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to