Re: [HACKERS] Initial 9.2 pgbench write results

Greg Smith Thu, 23 Feb 2012 03:18:31 -0800

I've updated http://highperfpostgres.com/results-write-9.2-cf4/index.htmwith more data including two alternate background writer configurations.Since the sensitive part of the original results was scales of 500 and1000, I've also gone back and added scale=750 runs to all results.Quick summary is that I'm not worried about 9.2 performance now, I'mincreasingly confident that the earlier problems I reported on are justbad interactions between the reinvigorated background writer andworkloads that are tough to write to disk. I'm satisfied I understandthese test results well enough to start evaluating the pending 9.2changes in the CF queue I wanted to benchmark.

Attached are now useful client and scale graphs. All of 9.0, 9.1, and9.2 have been run now with exactly the same scales and clients loads, sothe graphs of all three versions can be compared. The two 9.2variations with alternate parameters were only run at some scales, whichmeans you can't compare them usefully on the clients graph; only on thescaling one. They are very obviously in a whole different range of thatgraph, just ignore the two that are way below the rest.

Here's a repeat of the interesting parts of the data set with newpoints. Here "9.2N" is without no background writer, while "9.2H" has abackground writer set to half strength: bgwriter_lru_maxpages = 50 Ipicked one middle client level out of the result=750 results just tofocus better, relative results are not sensitive to that:


scale=500, db is 46% of RAM
Version Avg TPS
9.0  1961
9.1  2255
9.2  2525
9.2N 2267
9.2H 2300

scale=750, db is 69% of RAM; clients=16
Version Avg TPS
9.0  1298
9.1  1387
9.2  1477
9.2N 1489
9.2H 943

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)
9.2N 516
9.2H 400

The fact that almost all the performance regression against 9.2 goesaway if the background writer is disabled is an interesting point. Thatresults actually get worse at scale=500 without the background writer isanother. That pair of observations makes me feel better that there's atuning trade-off here being implicitly made by having a more activebackground writer in 9.2; it helps on some cases, hurts others. That Ican deal with. Everything lines up perfectly at scale=500: if Ireorder on TPS:


scale=500, db is 46% of RAM
Version Avg TPS
9.2  2525
9.2H 2300
9.2N 2267
9.1  2255
9.0  1961

That makes you want to say "the more background writer the better", right?

The new scale=750 numbers are weird though, and they keep this frombeing so clear. I ran the parts that were most weird twice just becausethey seemed so odd, and it was repeatable. Just like scale=500, withscale=750 the 9.2/no background writer has the best performance of anyrun. But the half-intensity one has the worst! It would be nice if itfell between the 9.2 and 9.2N results, instead it's at the other edge.

The only lesson I can think to draw here is that once we're in the areawhere performance is dominated by the trivia around exactly how writesare scheduled, the optimal ordering of writes is just too complicated tomodel that easily. The rest of this is all speculation on how to fitsome ideas to this data.

Going back to 8.3 development, one of the design trade-offs I was veryconcerned about was not wasting resources by having the BGW run toooften. Even then it was clear that for these simple pgbench tests,there were situations where letting backends do their own writes wasbetter than speculative writes from the background writer. The BGWconstantly risks writing a page that will be re-dirtied before it goesto disk. That can't be too common though in the current design, sinceit avoids things with high usage counts. (The original BGW wrote thingsthat were used recently, and that was a measurable problem by 8.3)

I think an even bigger factor now is that the BGW writes can disturbwrite ordering/combining done at the kernel and storage levels. It'spainfully obvious now how much PostgreSQL relies on that to get goodperformance. All sorts of things break badly if we aren't gettingrandom writes scheduled to optimize seek times, in as many contexts aspossible. It doesn't seem unreasonable that background writer writescan introduce some delay into the checkpoint writes, just by adding morerandom components to what is already a difficult to handle write/syncseries. That's what I think what these results are showing is thatbackground writer writes can deoptimize other forms of write.

A second fact that's visible from the TPS graphs over the test run, andobvious if you think about it, is that BGW writes force data to physicaldisk earlier than they otherwise might go there. That's a subtlepattern in the graphs. I expect that though, given one element to "do Iwrite this?" in Linux is how old the write is. Wondering about thisreally emphasises that I need to either add graphing of vmstat/iostatdata to these graphs or switch to a benchmark that does that already. Ithink I've got just enough people using pgbench-tools to justify thefeature even if I plan to use the program less.

I also have a good answer to "why does this only happen at thesescales?" now. At scales below here, the database is so small relativeto RAM that it just all fits all the time. That includes the indexesbeing very small, so not many writes generated by their dirty blocks.At higher scales, the write volume becomes seek bound, and the result isso low that checkpoints become timeout based. So there aresignificantly less of them. At the largest scales and client countshere, there isn't a single checkpoint actually finished at some of these10 minute long runs. One doesn't even start until 5 minutes have goneby, and the checkpoint writes are so slow they take longer than 5minutes to trickle out and sync, with all the competing I/O frombackends mixed in. Note that the "clients-sets" graph still shows astrong jump from 9.0 to 9.1 at high client counts; I'm pretty surethat's the fsync compaction at work.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

<<attachment: scaling-sets.png>>

<<attachment: clients-sets.png>>

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Initial 9.2 pgbench write results

Reply via email to