Re: [HACKERS] Load distributed checkpoint

Greg Smith Thu, 21 Dec 2006 22:17:10 -0800

On Wed, 20 Dec 2006, Inaam Rana wrote:

Talking of bgwriter_* parameters I think we are missing a crucialinternal counter i.e. number of dirty pages. How much work bgwriter hasto do at each wakeup call should be a function of total buffers andcurrently dirty buffers.

This is actually a question I'd been meaning to throw out myself to thislist. How hard would it be to add an internal counter to the buffermanagement scheme that kept track of the current number of dirty pages?I've been looking at the bufmgr code lately trying to figure out how toinsert one as part of building an auto-tuning bgwriter, but it's unclearto me how I'd lock such a resource properly and scalably. I have afeeling I'd be inserting a single-process locking bottleneck into thatcode with any of the naive implementations I considered.

The main problem I've been seeing is also long waits stuck behind a slowfsync on Linux. What I've been moving toward testing is an approachslightly different from the proposals here. What if all the database pagewrites (background writer, buffer eviction, or checkpoint scan) werecounted and periodic fsync requests send to the bgwriter based on that?For example, when I know I have a battery-backed caching controller thatwill buffer 64MB worth of data for me, if I forced a fsync after every6000 8K writes, no single fsync would get stuck waiting for the disk towrite for longer than I'd like.

Give the admin a max_writes_before_sync parameter, make the default of 0work just like the current behavior, and off you go; a simple tunable thatdoesn't require a complicated scheme to implement or break anybody'sexisting setup. Combined with a properly tuned background writer, thatwould solve the issues I've been running into. It would even make theproblem of Linux caching too many writes until checkpoint time go away (Iknow how to eliminate that by adjusting caching policy, but I have to beroot to do it; a DBA should be able to work around that issue even if theydon't have access to the kernel tunables.)

While I'm all for testing to prove me wrong, my gut feel is that going allthe way to sync writes a la Oracle is a doomed approach, particularly onlow-end hardware where they're super expensive. Following The Oracle Wayis a good roadmap for a lot of things, but I wouldn't put building a leanenough database to run on modest hardware on that list. You can do syncwrites with perfectly good performance on systems with a goodbattery-backed cache, but I think you'll get creamed in comparisonsagainst MySQL on IDE disks if you start walking down that path; sinceright now a fair comparison with similar logging behavior is an even matchthere, that's a step backwards.

Also on the topic of sync writes to the database proper: wouldn't usingO_DIRECT for those potentially counter-productive? I was under theimpressions that one of the behaviors counted on by Postgres was that dataevicted from its buffer cache, eventually intended for writing to disk,was still kept around for a bit in the OS buffer cache. A subsequent readbecause the data was needed again might find the data already in the OSbuffer, therefore avoiding an actual disk read; that substantially reducesthe typical penalty for the database engine making a bad choice on what toevict. I fear a move to direct writes would put more pressure on the LRUimplementation to be very smart, and that's code that you really don'twant to be more complicated.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] Load distributed checkpoint

Reply via email to