Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Sun, 14 Jul 2013 15:47:58 -0700

On 7/14/13 5:28 PM, james wrote:

Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn'tactually occur usefully in the situation everyone dislikes. In a writeheavy environment where the database doesn't fit in RAM, backends and/orthe background writer are constantly writing data out to the OS. WAL isgoing out constantly as well, and in many cases that's competing for thedisks too. The most popular blocks in the database get high usagecounts and they never leave shared_buffers except at checkpoint time.That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing.There is nothing happening for free. You're stealing I/O from somethingelse any time you force a write out. The optimal throughput path forcheckpoints turns out to be delaying every single bit of I/O as long aspossible, in favor of the [backend|bgwriter] writes and WAL. Wheneveryou delay a buffer write, you have increased the possibility thatsomeone else will write the same block again. And the buffers beingwritten by the checkpointer are, on average, the most popular ones inthe database. Writing any of them to disk pre-emptively has high oddsof writing the same block more than once per checkpoint. And that easyto measure waste--it shows as more writes/transaction inpg_stat_bgwriter--it hurts throughput more than every reduction in seekoverhead you might otherwise get from early writes. The big gain isn'tchasing after cheap seeks. The best path is the one that decreases thetotal volume of writes.

We played this game with the background writer work for 8.3. The mainreason the one committed improved on the original design is that itcompletely eliminated doing work on popular buffers in advance.Everything happens at the last possible time, which is the optimalthroughput situation. The 8.1/8.2 BGW used to try and write things outbefore they were strictly necessary, in hopes that that I/O would befree. But it rarely was, while there was always a cost to forcing themto disk early. And that cost is highest when you're talking about thehigher usage blocks the checkpointer tends to write. When in doubt,always delay the write in hopes it will be written to again and you'llsave work.

So it occurs to me that perhaps we can watch for patterns where we have
groups of adjacent writes that might stream, and when they form we might
schedule them...


Stop here.  I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening. Wecan't even easily make it know, not without adding a whole new source ofIPC and locking contention among clients.

Whatever scheduling decision the checkpointer might make with itslimited knowledge of system I/O is going to be poor. You might find a100% write benchmark that it helps, but those are not representative ofthe real world. In any mixed read/write case, the operating system islikely to do better. That's why things like sorting blocks sometimesseem to help someone, somewhere, with one workload, but then aren'trepeatable.

We can decide to trade throughput for latency by nudging the OS to dealwith its queued writes more regularly. That will result in more totalwrites, which is the reason throughput drops.

But the idea that PostgreSQL is going to do a better global job of I/Oscheduling, that road is a hard one to walk. It's only going to happenif we pull all of the I/O into the database *and* do a better job on theentire process than the existing OS kernel does. That sort of dream, ofoutperforming the filesystem, it is very difficult to realize. There'sa good reason that companies like Oracle stopped pushing so hard onrecommending raw partitions.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to