Jeff Janes wrote:
And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

Right, you can't set dirty_background_ratio low enough to make this problem go away. Even attempts to set it to 1%, back when that that was the right size for it, seem to be defeated by other mechanisms within the kernel. Last time I looked at the related source code, it seemed the "congestion control" logic that kicks in to throttle writes was a likely suspect. This is why I'm not real optimistic about newer mechanism like the dirty_background_bytes added 2.6.29 to help here, as that just gives a mapping to setting lower values; the same basic logic is under the hood.

Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly due to the same congestion mechanism.
Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

I've been trying to make this problem go away using just the kernel tunables available since 2006. I adjusted them carefully on the server that ran into this problem so badly that it motivated the submitted patch, months before this issue got bad. It didn't help. Maybe if they were running a later kernel that supported dirty_background_bytes that would have worked better. During the last few years, the only thing that has consistently helped in every case is the checkpoint spreading logic that went into 8.3. I no longer expect that the kernel developers will ever make this problem go away the way checkpoints are written out right now, whereas the last good PostgreSQL work in this area definitely helped.

The basic premise of the current checkpoint code is that if you write all of the buffers out early enough, by the time syncs execute enough of the data should have gone out that those don't take very long to process. That was usually true for the last few years, on systems with a battery-backed cache; the amount of memory cached by the OS was relatively small relative to the RAID cache size. That's not the case anymore, and that divergence is growing bigger.

The idea that the checkpoint sync code can run in a relatively tight loop, without stopping to do the normal background writer cleanup work, is also busted by that observation.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to