[PERFORM] postgresql.conf recommendations

Johnny Tan Tue, 05 Feb 2013 14:02:57 -0800

Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)



/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in
front of our db.


The problem:

For the most part, the server hums along. No other applications run on this
server other than postgres. Load averages rarely break 2.0, it never swaps,
and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally,
the app's db response time hovers in the 100ms range for most of the day.
During the spike times, it can go up to 1000ms or 1500ms, and the number of
pg connections goes to 140 (maxed out to pgbouncer's limit, where normally
it's only about 20-40 connections). Also, during these times, which usually
last less than 2 minutes, we will see several thousand queries in the pg
log (this is with log_min_duration_statement = 500), compared to maybe one
or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There
doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled
job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system
or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system
in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all
other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our
alerting system sometimes either says there's a timeout or that it had
multiple web response times greater than 300ms, and so we suspect (but have
no proof) that some users will see either a long hang or possibly a
timeout. But since it's almost always less than two minutes, and sometimes
less than one, we don't really hear any complaints (guessing that most
people hit reload, and things work again, so they continue on), and we
haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that
will get worse as traffic continues to grow. We've tweaked various configs
on the OS side as well as the postgresql.conf side. What's posted above is
our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any
recommendations for our postgresql.conf or sysctl.conf based on our
hardware? From pgtune's output, I am lowering maintenance_work_mem from
24GB down to maybe 2GB, but I keep reading conflicting things about other
settings, such as checkpoints or max_connections.

johnny

[PERFORM] postgresql.conf recommendations

Reply via email to