On 7/16/13 12:46 PM, Ants Aasma wrote:

Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty.

That's arguing that you can make this feature be useful if you tune in a particular way. That's interesting, but the goal here isn't to prove the existence of some workload that a change is useful for. You can usually find a test case that validates any performance patch as helpful if you search for one. Everyone who has submitted a sorted checkpoint patch for example has found some setup where it shows significant gains. We're trying to keep performance stable across a much wider set of possibilities though.

Let's talk about default parameters instead, which quickly demonstrates where your assumptions fail. The server I happen to be running pgbench tests on today has 72GB of RAM running SL6 with RedHat derived kernel 2.6.32-358.11.1. This is a very popular middle grade server configuration nowadays. There dirty_background_ratio and dirty_background_ratio are 10 (percent). That means that roughly 7GB of RAM can be used for write caching. Note that this is a fairly low write cache tuning compared to a survey of systems in the field--lots of people have servers with earlier kernels where these numbers can be as high as 20 or even 40% instead.

The current feasible tuning for shared_buffers suggests a value of 8GB is near the upper limit, beyond which cache related overhead makes increases counterproductive. Your examples are showing 53% of shared_buffers dirty at checkpoint time; that's typical. The checkpointer is then writing out just over 4GB of data.

With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize. That potentially includes backend, background writer, checkpoint, temp table, statistics, log, and WAL data. The scheduler is also considering read operations.

-The checkpointer process has 4GB of writes from rarely written shared memory it's trying to optimize.

This is why if you take the opposite approach of yours today--go searching for workloads where sorting is counterproductive--those are equally easy to find. Any test of write speed I do starts with about 50 different scale/client combinations. Why do I suggest pgbench-tools as a way to do performance tests? It's because an automated sweep of client setups like it does is the minimum necessary to create enough variation in workload for changing the database's write path. It's really amazing how often doing that shows a proposed change is just shuffling the good and bad cases around. That's been the case for every sorting and fsync delay change submitted so far. I'm not even interested in testing today's submission because I tried that particular approach for a few months, twice so far, and it fell apart on just as many workloads as it helped.

The checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write

True only if shared_buffers is large compared to the OS write cache, which was not the case on the example I generated with all of a minute's work. I regularly see servers where Linux's "Dirty" area becomes a multiple of the dirty buffers written by a checkpoint. I can usually make that happen at will with CLUSTER and VACUUM on big tables. The idea that the checkpointer has a long-term view while the OS has a short one, that presumes a setup that I would say is possible but not common.

kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.

You disclaimed this as a best case scenario. It is a low throughput / low latency tuning. That's fine, but if Postgres optimizes itself toward those cases it runs the risk of high throughput servers with large caches being detuned. I've posted examples before showing very low write caches like this leading to VACUUM running at 1/2 its normal speed or worse, as a simple example of where a positive change in one area can backfire badly on another workload. That particular problem was so common I updated pgbench-tools recently to track table maintenance time between tests, because that demonstrated an issue even when the TPS numbers all looked fine.

Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to