Re: [HACKERS] Redesigning checkpoint_segments

Greg Smith Thu, 06 Jun 2013 19:07:40 -0700

On 6/6/13 4:42 AM, Joshua D. Drake wrote:


On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:


(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will
overwhelm the I/O subsystem, and slow down queries. That's what we saw
before spread checkpoints: when a checkpoint happens, the response
times of queries jumped up.


That isn't quite right. Previously we had lock issues as well and
checkpoints would take considerable time to complete. What I am talking
about is that the background writer (and wal writer where applicable)
have done all the work before a checkpoint is even called.

That is not possible, and if you look deeper at a lot of workloads you'll eventually see why. I'd recommend grabbing snapshots of pg_buffercache output from a lot of different types of servers and see what the usage count distribution looks like. That's what did in order to create all of the behaviors the current background writer code caters to. Attached is a small spreadsheet that shows the main two extremes here, from one of my old talks. "Effective buffer cache system" is full of usage count 5 pages, while the "Minimally effective buffer cache" one is all usage count 1 or 0. We don't have redundant systems here; we have two that aim at distinctly different workloads. That's one reason why splitting them apart ended up being necessary to move forward, they really don't overlap very much on some servers.

Sampling a few servers that way was where the controversial idea of scanning the whole buffer pool every few minutes even without activity came from too. I found a bursty real world workload where that was necessary to keep buffers clean usefully, and that heuristic helped them a lot. I too would like to visit the exact logic used, but I could cook up a test case where it's useful again if people really doubt it has any value. There's one in the 2007 archives somewhere.

The reason the checkpointer code has to do this work, and it has to spread the writes out, is that on some systems the hot data set hits a high usage count. If shared_buffers is 8GB and at any moment 6GB of it has a usage count of 5, which absolutely happens on many busy servers, the background writer will do almost nothing useful. It won't and shouldn't touch buffers unless their usage count is low. Those heavily referenced blocks will only be written to disk once per checkpoint cycle.

Without the spreading, in this example you will drop 6GB into "Dirty Memory" on a Linux server, call fdatasync, and the server might stop doing any work at all for *minutes* of time. Easiest way to see it happen is to set checkpoint_completion_target to 0, put the filesystem on ext3, and have a server with lots of RAM. I have a monitoring tool that graphs Dirty Memory over time because this problem is so nasty even with the spreading code in place.

There is this idea that pops up sometimes that a background writer write is better than a checkpoint one. This is backwards. A dirty block must be written at least once per checkpoint. If you only write it once per checkpoint, inside of the checkpoint process, that is the ideal. It's what you want for best performance when it's possible.

At the same time, some workloads churn through a lot of low usage count data, rather than building up a large block of high usage count stuff. On those your best hope for low latency is to crank up the background writer and let it try to stay ahead of backends with the writes. The checkpointer won't have nearly as much work to do in that situation.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

bgwriter-snapshot.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

Reply via email to