I've updated http://highperfpostgres.com/results-write-9.2-cf4/index.htm with more data including two alternate background writer configurations. Since the sensitive part of the original results was scales of 500 and 1000, I've also gone back and added scale=750 runs to all results. Quick summary is that I'm not worried about 9.2 performance now, I'm increasingly confident that the earlier problems I reported on are just bad interactions between the reinvigorated background writer and workloads that are tough to write to disk. I'm satisfied I understand these test results well enough to start evaluating the pending 9.2 changes in the CF queue I wanted to benchmark.

Attached are now useful client and scale graphs. All of 9.0, 9.1, and 9.2 have been run now with exactly the same scales and clients loads, so the graphs of all three versions can be compared. The two 9.2 variations with alternate parameters were only run at some scales, which means you can't compare them usefully on the clients graph; only on the scaling one. They are very obviously in a whole different range of that graph, just ignore the two that are way below the rest.

Here's a repeat of the interesting parts of the data set with new points. Here "9.2N" is without no background writer, while "9.2H" has a background writer set to half strength: bgwriter_lru_maxpages = 50 I picked one middle client level out of the result=750 results just to focus better, relative results are not sensitive to that:

scale=500, db is 46% of RAM
Version Avg TPS
9.0  1961
9.1  2255
9.2  2525
9.2N 2267
9.2H 2300

scale=750, db is 69% of RAM; clients=16
Version Avg TPS
9.0  1298
9.1  1387
9.2  1477
9.2N 1489
9.2H 943

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)
9.2N 516
9.2H 400

The fact that almost all the performance regression against 9.2 goes away if the background writer is disabled is an interesting point. That results actually get worse at scale=500 without the background writer is another. That pair of observations makes me feel better that there's a tuning trade-off here being implicitly made by having a more active background writer in 9.2; it helps on some cases, hurts others. That I can deal with. Everything lines up perfectly at scale=500: if I reorder on TPS:

scale=500, db is 46% of RAM
Version Avg TPS
9.2  2525
9.2H 2300
9.2N 2267
9.1  2255
9.0  1961

That makes you want to say "the more background writer the better", right?

The new scale=750 numbers are weird though, and they keep this from being so clear. I ran the parts that were most weird twice just because they seemed so odd, and it was repeatable. Just like scale=500, with scale=750 the 9.2/no background writer has the best performance of any run. But the half-intensity one has the worst! It would be nice if it fell between the 9.2 and 9.2N results, instead it's at the other edge.

The only lesson I can think to draw here is that once we're in the area where performance is dominated by the trivia around exactly how writes are scheduled, the optimal ordering of writes is just too complicated to model that easily. The rest of this is all speculation on how to fit some ideas to this data.

Going back to 8.3 development, one of the design trade-offs I was very concerned about was not wasting resources by having the BGW run too often. Even then it was clear that for these simple pgbench tests, there were situations where letting backends do their own writes was better than speculative writes from the background writer. The BGW constantly risks writing a page that will be re-dirtied before it goes to disk. That can't be too common though in the current design, since it avoids things with high usage counts. (The original BGW wrote things that were used recently, and that was a measurable problem by 8.3)

I think an even bigger factor now is that the BGW writes can disturb write ordering/combining done at the kernel and storage levels. It's painfully obvious now how much PostgreSQL relies on that to get good performance. All sorts of things break badly if we aren't getting random writes scheduled to optimize seek times, in as many contexts as possible. It doesn't seem unreasonable that background writer writes can introduce some delay into the checkpoint writes, just by adding more random components to what is already a difficult to handle write/sync series. That's what I think what these results are showing is that background writer writes can deoptimize other forms of write.

A second fact that's visible from the TPS graphs over the test run, and obvious if you think about it, is that BGW writes force data to physical disk earlier than they otherwise might go there. That's a subtle pattern in the graphs. I expect that though, given one element to "do I write this?" in Linux is how old the write is. Wondering about this really emphasises that I need to either add graphing of vmstat/iostat data to these graphs or switch to a benchmark that does that already. I think I've got just enough people using pgbench-tools to justify the feature even if I plan to use the program less.

I also have a good answer to "why does this only happen at these scales?" now. At scales below here, the database is so small relative to RAM that it just all fits all the time. That includes the indexes being very small, so not many writes generated by their dirty blocks. At higher scales, the write volume becomes seek bound, and the result is so low that checkpoints become timeout based. So there are significantly less of them. At the largest scales and client counts here, there isn't a single checkpoint actually finished at some of these 10 minute long runs. One doesn't even start until 5 minutes have gone by, and the checkpoint writes are so slow they take longer than 5 minutes to trickle out and sync, with all the competing I/O from backends mixed in. Note that the "clients-sets" graph still shows a strong jump from 9.0 to 9.1 at high client counts; I'm pretty sure that's the fsync compaction at work.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

<<attachment: scaling-sets.png>>

<<attachment: clients-sets.png>>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to