On Thu, 6 Sep 2007, Kevin Grittner wrote:
If you exposed the scan_whole_pool_seconds as a tunable GUC, that would allay all of my concerns about this patch. Basically, our problems were resolved by getting all dirty buffers out to the OS cache within two seconds
Unfortunately it wouldn't make my concerns about your system go away or I'd have recommended exposing it specifically to address your situation. I have been staring carefully at your configuration recently, and I would wager that you could turn off the LRU writer altogether and still meet your requirements in 8.2. Here's what you've got right now:
shared_buffers = 160MB (=20000 buffers) bgwriter_lru_percent = 20.0 bgwriter_lru_maxpages = 200 bgwriter_all_percent = 10.0 bgwriter_all_maxpages = 600
With the default delay of 200ms, this has the LRU-writer scanning the whole pool every 1 second, while the all-writer scans every two seconds--assuming they don't hit the write limits. If some event were to dirty the whole pool in 200ms, it might take as much as 6.7 seconds to write everything out (20000 / 600 * 200 ms) via the all-scan. The all-scan is already gone in 8.3. Your LRU scan will take much longer than that to clear everything out. At least (20000 / 200 * 200ms) 20 seconds to clear a fully dirty cache.
But in fact, it's impossible to even bound how long it will take before the LRU writer (which is the only part this new patch tries to improve) gets around to writing even a single dirty buffer no matter what bgwriter_lru_percent (8.2) or scan_whole_pool_seconds (JIT patch) is set to.
There's a second low-level issue involved here. When a page becomes dirty, that implies it was also recently used, which means the LRU writer won't touch it. That page can't be written out by the LRU writer until an entire pass has been made over the shared_buffer pool while looking for buffers to allocate for new activity. When the allocation clock-sweep passes over the newly dirtied buffer again, its usage count will drop by one and it will no longer be considered recently used. At that point the LRU writer can write it out. So unless there is other allocation activity going on, the scan_whole_pool_seconds mechanism will never provide the bound on time to scan and write everything you hope it will.
And if there's other allocations going on, the much more powerful JIT mechanism will scan the whole pool plenty fast if you bump the already exposed multiplier tunable up. In my tests where the buffer cache was filled with mostly dirty buffers that couldn't be re-used (something relatively easy to trigger with pgbench tests), I've actually watched the new code scan >90% of the buffer cache looking for those few reusable buffers in the pool in a single invocation. This would be like setting bgwriter_lru_percent=90.0 in the old configuration, but it only gets that aggressive when the distribution of pages in the buffer cache demands it, and when it has reason to believe going that fast will be helpful.
The completely understandable line of thinking that led to your request here is one of my concerns with exposing scan_whole_pool_seconds as a tunable. It may suggest to people that if they set the number very low, it will assure all dirty buffers will be scanned and written within that time bound. That's certainly not the case; both the maxpages and the usage count information will actually drive the speed that mechanism plods through the buffer cache. It really isn't useful for scanning fast.
-- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings