Robert Haas wrote:
Well, why can't they just hang out as dirty buffers in the OS cache,
which is also designed to solve this problem?

If the OS were guaranteed to be as suitable for this purpose as the approach taken in the database, this might work. But much like the clock sweep approach should outperform a simpler OS caching implementation in many common workloads, there are a couple of spots where making dirty writes the OS's problem can fall down:

1) That presumes that OS write coalescing will solve the problem for you by merging repeat writes, which depending on implementation it might not.

2) On some filesystems, such as ext3, any write with an fsync behind it will flush the whole write cache out and defeat this optimization. Since the spread checkpoint design has some such writes going to the data disk in the middle of the currently processing checkpoing, in those situations that's likely to push the first write of that block to disk before it can be combined with a second. If you'd have kept it in the buffer cache it might survive as long as a full checkpoint cycle longer..

3) The "timeout" as it were for shared buffers is driven by the distance between checkpoints, typically as long as 5 minutes. The longest a filesystem will hold onto a write is probably less. On Linux it's typically 30 seconds before the OS considers a write important to get out to disk, longest case; if you've already filled a lot of RAM with writes it can be substantially less.

I guess the obvious question is whether Windows "doesn't need" more
shared memory than that, or whether it "can't effectively use" more
memory than that.

It's probably can't effectively use. We know for a fact that applications where blocks regularly accumulate high usage counts and have repeat read/writes to them, which includes pgbench, benefit in several easy to measure ways from using larger amounts of database buffer cache. There's just plain old less churn of buffers going in and out of there. The alternate explanation of "Windows is just so much better at read/write caching that you should give it most of the RAM anyway" doesn't really sound as probable as the more commonly proposed theory "Windows doesn't handle large blocks of shared memory well".

Note that there's no discussion of the why behind this is in the commit you just did, just the description of what happens. The reasons why are left undefined, which I feel is appropriate given we really don't know for sure. Still waiting for somebody to let loose the Visual Studio profiler and measure what's causing the degradation at larger sizes.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to