On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:
> Robert Haas <robertmh...@gmail.com> wrote:
> > Jan Kara <j...@suse.cz> wrote:
> >> Just to get some idea about the sizes - how large are the
> >> checkpoints we are talking about that cause IO stalls?
> > Big.
> To quantify that, in a production setting we were seeing pauses of
> up to two minutes with shared_buffers set to 8GB and default dirty
> page settings for Linux, on a machine with 256GB RAM and 512MB
There's your problem.
By default, background writeback doesn't start until 10% of memory
is dirtied, and on your machine that's 25GB of RAM. That's way to
high for your workload.
It appears to me that we are seeing large memory machines much more
commonly in data centers - a couple of years ago 256GB RAM was only
seen in supercomputers. Hence machines of this size are moving from
"tweaking settings for supercomputers is OK" class to "tweaking
settings for enterprise servers is not OK"....
Perhaps what we need to do is deprecate dirty_ratio and
dirty_background_ratio as the default values as move to the byte
based values as the defaults and cap them appropriately. e.g.
10/20% of RAM for small machines down to a couple of GB for large
> non-volatile cache on the RAID controller. To eliminate stalls we
> had to drop shared_buffers to 2GB (to limit how many dirty pages
> could build up out-of-sight from the OS), spread checkpoints to 90%
> of allowed time (almost no gap between finishing one checkpoint and
> starting the next) and crank up the background writer so that no
> dirty page sat unwritten in PostgreSQL shared_buffers for more than
> 4 seconds. Less aggressive pushing to the OS resulted in the
> avalanche of writes I previously described, with the corresponding
> I/O stalls. We approached that incrementally, and that's the point
> where stalls stopped occurring. We did not adjust the OS
> thresholds for writing dirty pages, although I know of others who
> have had to do so.
Essentially, changing dirty_background_bytes, dirty_bytes and
dirty_expire_centiseconds to be much smaller should make the kernel
start writeback much sooner and so you shouldn't have to limit the
amount of buffers the application has to prevent major fsync
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: