On Wed, 20 Dec 2006, Inaam Rana wrote:
Talking of bgwriter_* parameters I think we are missing a crucial
internal counter i.e. number of dirty pages. How much work bgwriter has
to do at each wakeup call should be a function of total buffers and
currently dirty buffers.
This is actually a question I'd been meaning to throw out myself to this
list. How hard would it be to add an internal counter to the buffer
management scheme that kept track of the current number of dirty pages?
I've been looking at the bufmgr code lately trying to figure out how to
insert one as part of building an auto-tuning bgwriter, but it's unclear
to me how I'd lock such a resource properly and scalably. I have a
feeling I'd be inserting a single-process locking bottleneck into that
code with any of the naive implementations I considered.
The main problem I've been seeing is also long waits stuck behind a slow
fsync on Linux. What I've been moving toward testing is an approach
slightly different from the proposals here. What if all the database page
writes (background writer, buffer eviction, or checkpoint scan) were
counted and periodic fsync requests send to the bgwriter based on that?
For example, when I know I have a battery-backed caching controller that
will buffer 64MB worth of data for me, if I forced a fsync after every
6000 8K writes, no single fsync would get stuck waiting for the disk to
write for longer than I'd like.
Give the admin a max_writes_before_sync parameter, make the default of 0
work just like the current behavior, and off you go; a simple tunable that
doesn't require a complicated scheme to implement or break anybody's
existing setup. Combined with a properly tuned background writer, that
would solve the issues I've been running into. It would even make the
problem of Linux caching too many writes until checkpoint time go away (I
know how to eliminate that by adjusting caching policy, but I have to be
root to do it; a DBA should be able to work around that issue even if they
don't have access to the kernel tunables.)
While I'm all for testing to prove me wrong, my gut feel is that going all
the way to sync writes a la Oracle is a doomed approach, particularly on
low-end hardware where they're super expensive. Following The Oracle Way
is a good roadmap for a lot of things, but I wouldn't put building a lean
enough database to run on modest hardware on that list. You can do sync
writes with perfectly good performance on systems with a good
battery-backed cache, but I think you'll get creamed in comparisons
against MySQL on IDE disks if you start walking down that path; since
right now a fair comparison with similar logging behavior is an even match
there, that's a step backwards.
Also on the topic of sync writes to the database proper: wouldn't using
O_DIRECT for those potentially counter-productive? I was under the
impressions that one of the behaviors counted on by Postgres was that data
evicted from its buffer cache, eventually intended for writing to disk,
was still kept around for a bit in the OS buffer cache. A subsequent read
because the data was needed again might find the data already in the OS
buffer, therefore avoiding an actual disk read; that substantially reduces
the typical penalty for the database engine making a bad choice on what to
evict. I fear a move to direct writes would put more pressure on the LRU
implementation to be very smart, and that's code that you really don't
want to be more complicated.
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend