From: "Greg Smith" <[EMAIL PROTECTED]>
> This is actually a question I'd been meaning to throw out myself to
> list. How hard would it be to add an internal counter to the buffer
> management scheme that kept track of the current number of dirty
> I've been looking at the bufmgr code lately trying to figure out how
> insert one as part of building an auto-tuning bgwriter, but it's
> to me how I'd lock such a resource properly and scalably. I have a
> feeling I'd be inserting a single-process locking bottleneck into
> code with any of the naive implementations I considered.
To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers? Do you know the book
"Principles of Transaction Processing"? Jim Gray was one of the
reviewers of this book.
In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach. In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint. This won't need any locking
in PostgreSQL if I understand correctly. Then, the recovery manager
performs the next checkpoint after writing those dirty buffers. In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint. Two-checkpoint is described in Jim Gray's book,
too. But they don't refer to how the recovery manager tunes the speed
> slightly different from the proposals here. What if all the
> writes (background writer, buffer eviction, or checkpoint scan) were
> counted and periodic fsync requests send to the bgwriter based on
> For example, when I know I have a battery-backed caching controller
> will buffer 64MB worth of data for me, if I forced a fsync after
> 6000 8K writes, no single fsync would get stuck waiting for the disk
> write for longer than I'd like.
That seems interesting.
> You can do sync
> writes with perfectly good performance on systems with a good
> battery-backed cache, but I think you'll get creamed in comparisons
> against MySQL on IDE disks if you start walking down that path;
> right now a fair comparison with similar logging behavior is an even
> there, that's a step backwards.
I wonder what characteristics SATA disks have compared to IDE. Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?
> Also on the topic of sync writes to the database proper: wouldn't
> O_DIRECT for those potentially counter-productive? I was under the
> impressions that one of the behaviors counted on by Postgres was
> evicted from its buffer cache, eventually intended for writing to
> was still kept around for a bit in the OS buffer cache. A
> because the data was needed again might find the data already in the
> buffer, therefore avoiding an actual disk read; that substantially
> the typical penalty for the database engine making a bad choice on
> evict. I fear a move to direct writes would put more pressure on
> implementation to be very smart, and that's code that you really
> want to be more complicated.
I'm worried about this, too.
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not