On Wed, 20 Dec 2006, Inaam Rana wrote:

Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers.

This is actually a question I'd been meaning to throw out myself to this list. How hard would it be to add an internal counter to the buffer management scheme that kept track of the current number of dirty pages? I've been looking at the bufmgr code lately trying to figure out how to insert one as part of building an auto-tuning bgwriter, but it's unclear to me how I'd lock such a resource properly and scalably. I have a feeling I'd be inserting a single-process locking bottleneck into that code with any of the naive implementations I considered.

The main problem I've been seeing is also long waits stuck behind a slow fsync on Linux. What I've been moving toward testing is an approach slightly different from the proposals here. What if all the database page writes (background writer, buffer eviction, or checkpoint scan) were counted and periodic fsync requests send to the bgwriter based on that? For example, when I know I have a battery-backed caching controller that will buffer 64MB worth of data for me, if I forced a fsync after every 6000 8K writes, no single fsync would get stuck waiting for the disk to write for longer than I'd like.

Give the admin a max_writes_before_sync parameter, make the default of 0 work just like the current behavior, and off you go; a simple tunable that doesn't require a complicated scheme to implement or break anybody's existing setup. Combined with a properly tuned background writer, that would solve the issues I've been running into. It would even make the problem of Linux caching too many writes until checkpoint time go away (I know how to eliminate that by adjusting caching policy, but I have to be root to do it; a DBA should be able to work around that issue even if they don't have access to the kernel tunables.)

While I'm all for testing to prove me wrong, my gut feel is that going all the way to sync writes a la Oracle is a doomed approach, particularly on low-end hardware where they're super expensive. Following The Oracle Way is a good roadmap for a lot of things, but I wouldn't put building a lean enough database to run on modest hardware on that list. You can do sync writes with perfectly good performance on systems with a good battery-backed cache, but I think you'll get creamed in comparisons against MySQL on IDE disks if you start walking down that path; since right now a fair comparison with similar logging behavior is an even match there, that's a step backwards.

Also on the topic of sync writes to the database proper: wouldn't using O_DIRECT for those potentially counter-productive? I was under the impressions that one of the behaviors counted on by Postgres was that data evicted from its buffer cache, eventually intended for writing to disk, was still kept around for a bit in the OS buffer cache. A subsequent read because the data was needed again might find the data already in the OS buffer, therefore avoiding an actual disk read; that substantially reduces the typical penalty for the database engine making a bad choice on what to evict. I fear a move to direct writes would put more pressure on the LRU implementation to be very smart, and that's code that you really don't want to be more complicated.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to