Just out of curiosity, are you using transparent huge pages? On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnyd...@gmail.com> wrote:
> Server specs: > Dell R610 > dual E5645 hex-core 2.4GHz > 192GB RAM > RAID 1: 2x400GB SSD (OS + WAL logs) > RAID 10: 4x400GB SSD (/var/lib/pgsql) > > > /etc/sysctl.conf: > kernel.msgmnb = 65536 > kernel.msgmax = 65536 > kernel.shmmax = 68719476736 > kernel.shmall = 4294967296 > vm.overcommit_memory = 0 > vm.swappiness = 0 > vm.dirty_background_bytes = 536870912 > vm.dirty_bytes = 536870912 > > > postgresql.conf: > listen_addresses = '*' # what IP address(es) to listen on; > max_connections = 150 # (change requires restart) > shared_buffers = 48GB # min 128kB > work_mem = 1310MB # min 64kB > maintenance_work_mem = 24GB # min 1MB > wal_level = hot_standby # minimal, archive, or hot_standby > checkpoint_segments = 64 # in logfile segments, min 1, 16MB each > checkpoint_timeout = 30min # range 30s-1h > checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 > max_wal_senders = 5 # max number of walsender processes > wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables > hot_standby = on # "on" allows queries during recovery > max_standby_archive_delay = 120s # max delay before canceling queries > max_standby_streaming_delay = 120s # max delay before canceling queries > effective_cache_size = 120GB > constraint_exclusion = partition # on, off, or partition > log_destination = 'syslog' # Valid values are combinations of > logging_collector = on # Enable capturing of stderr and csvlog > log_directory = 'pg_log' # directory where log files are written, > log_filename = 'postgresql-%a.log' # log file name pattern, > log_truncate_on_rotation = on # If on, an existing log file with the > log_rotation_age = 1d # Automatic rotation of logfiles will > log_rotation_size = 0 # Automatic rotation of logfiles will > log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements > log_checkpoints = on > log_line_prefix = 'user=%u db=%d remote=%r ' # special values: > log_lock_waits = on # log lock waits >= deadlock_timeout > autovacuum = on # Enable autovacuum subprocess? 'on' > log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and > autovacuum_max_workers = 5 # max number of autovacuum subprocesses > datestyle = 'iso, mdy' > lc_messages = 'en_US.UTF-8' # locale for system error message > lc_monetary = 'en_US.UTF-8' # locale for monetary formatting > lc_numeric = 'en_US.UTF-8' # locale for number formatting > lc_time = 'en_US.UTF-8' # locale for time formatting > default_text_search_config = 'pg_catalog.english' > deadlock_timeout = 300ms > > > per pgtune: > > #------------------------------------------------------------------------------ > # pgtune wizard run on 2013-02-05 > # Based on 198333224 KB RAM in the server > > #------------------------------------------------------------------------------ > default_statistics_target = 100 > maintenance_work_mem = 1GB > checkpoint_completion_target = 0.9 > effective_cache_size = 128GB > work_mem = 1152MB > wal_buffers = 8MB > checkpoint_segments = 16 > shared_buffers = 44GB > max_connections = 80 > > We use pgbouncer (set to 140 connections) in transaction pooling mode in > front of our db. > > > The problem: > > For the most part, the server hums along. No other applications run on > this server other than postgres. Load averages rarely break 2.0, it never > swaps, and %iowait is usually not more than 0.12 > > But periodically, there are spikes in our app's db response time. > Normally, the app's db response time hovers in the 100ms range for most of > the day. During the spike times, it can go up to 1000ms or 1500ms, and the > number of pg connections goes to 140 (maxed out to pgbouncer's limit, where > normally it's only about 20-40 connections). Also, during these times, > which usually last less than 2 minutes, we will see several thousand > queries in the pg log (this is with log_min_duration_statement = 500), > compared to maybe one or two dozen 500ms+ queries in non-spike times. > > Inbetween spikes could be an hour, two hours, sometimes half a day. There > doesn't appear to be any pattern that we can see: > * there are no obvious queries that are locking the db > * it doesn't necessarily happen during high-traffic times, though it can > * it doesn't happen during any known system, db, or app > regularly-scheduled job, including crons > * in general, there's no discernible regularity to it at all > * it doesn't coincide with checkpoint starts or completions > * it doesn't coincide with autovacuums > * there are no messages in any system logs that might indicate any system > or hardware-related issue > > Besides spikes in our graphs, the only other visible effect is that > %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait > and all other sar variables remain the same). > > And according to our monitoring system, web requests get queued up, and > our alerting system sometimes either says there's a timeout or that it had > multiple web response times greater than 300ms, and so we suspect (but have > no proof) that some users will see either a long hang or possibly a > timeout. But since it's almost always less than two minutes, and sometimes > less than one, we don't really hear any complaints (guessing that most > people hit reload, and things work again, so they continue on), and we > haven't been able to see any negative effect ourselves. > > But we want to get in front of the problem, in case it is something that > will get worse as traffic continues to grow. We've tweaked various configs > on the OS side as well as the postgresql.conf side. What's posted above is > our current setup, and the problem persists. > > Any ideas as to where we could even look? > > Also, whether related or unrelated to the spikes, are there any > recommendations for our postgresql.conf or sysctl.conf based on our > hardware? From pgtune's output, I am lowering maintenance_work_mem from > 24GB down to maybe 2GB, but I keep reading conflicting things about other > settings, such as checkpoints or max_connections. > > johnny > >