On 07/17/2012 06:56 PM, Tom Lane wrote:
So I went to fix this in the obvious way (attached), but while testing it I found that the number of buffers_backend events reported during a regression test run barely changed; which surprised the heck out of me, so I dug deeper. The cause turns out to be extremely scary: ForwardFsyncRequest isn't getting called at all in the bgwriter process, because the bgwriter process has a pendingOpsTable.
When I did my testing early this year to look at checkpointer performance (among other 9.2 write changes like group commit), I did see some cases where buffers_backend was dramatically different on 9.2 vs. 9.1 There were plenty of cases where the totals across a 10 minute pgbench were almost identical though, so this issue didn't stick out then. That's a very different workload than the regression tests though.
This implies that nobody has done pull-the-plug testing on either HEAD or 9.2 since the checkpointer split went in (2011-11-01), because even a modicum of such testing would surely have shown that we're failing to fsync a significant fraction of our write traffic.
Ugh. Most of my pull the plug testing the last six months has been focused on SSD tests with older versions. I want to duplicate this (and any potential fix) now that you've highlighted it.
Furthermore, I would say that any performance testing done since then, if it wasn't looking at purely read-only scenarios, isn't worth the electrons it's written on. In particular, any performance gain that anybody might have attributed to the checkpointer splitup is very probably hogwash.
There hasn't been any performance testing that suggested the checkpointer splitup was justified. The stuff I did showed it being flat out negative for a subset of pgbench oriented cases, which didn't seem real-world enough to disprove it as the right thing to do though.
I thought there were two valid justifications for the checkpointer split (which is not a feature I have any corporate attachment to--I'm as isolated from how it was developed as you are). The first is that it seems like the right architecture to allow reworking checkpoints and background writes for future write path optimization. A good chunk of the time when I've tried to improve one of those (like my spread sync stuff from last year), the code was complicated by the background writer needing to follow the drum of checkpoint timing, and vice-versa. Being able to hack on those independently got a sign of relief from me. And while this adds some code duplication in things like the process setup, I thought the result would be cleaner for people reading the code to follow too. This problem is terrible, but I think part of how it crept in is that the single checkpoint+background writer process was doing way too many things to even follow all of them some days.
The second justification for the split was that it seems easier to get a low power result from, which I believe was the angle Peter Geoghegan was working when this popped up originally. The checkpointer has to run sometimes, but only at a 50% duty cycle as it's tuned out of the box. It seems nice to be able to approach that in a way that's power efficient without coupling it to whatever heartbeat the BGW is running at. I could even see people changing the frequencies for each independently depending on expected system load. Tune for lower power when you don't expect many users, that sort of thing.
-- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers