David, Interesting observations. I had not been tracking the interrupts but perhaps I should take a look. How are you measuring them over a period of time, or are you just getting them real time?
Did you turn off THP all together or just the THP defrag? On Wed, Feb 6, 2013 at 10:42 AM, David Whittaker <d...@iradix.com> wrote: > Josh/Johnny, > > We've been seeing a similar problem as well, and had also figured THP was > involved. We found this in syslog: > https://gist.github.com/davewhittaker/4723285, which led us to disable > THP 2 days ago. At first the results seemed good. In particular, our > issues always seemed interrupt related and the average interrupts/sec > immediately dropped from 7k to around 3k after restarting. The good news > is that we didn't see any spike in system CPU time yesterday. The bad news > is that we did see a spike in app latency that originated from the DB, but > now the spike is in user CPU time and seems to be spread across all of the > running postgres processes. Interrupts still blew up to 21k/sec when it > happened. We are still diagnosing, but I'd be curious to see if either of > you get similar results from turning THP off. > > > On Tue, Feb 5, 2013 at 11:23 PM, Josh Krupka <jkru...@gmail.com> wrote: > >> I've been looking into something on our system that sounds similar to >> what you're seeing. I'm still researching it, but I'm suspecting the >> memory compaction that runs as part of transparent huge pages when memory >> is allocated... yet to be proven. The tunable you mentioned controls the >> compaction process that runs at allocation time so it can try to allocate >> large pages, there's a separate one that controls if the compaction is done >> in khugepaged, and a separate one that controls whether THP is used at all >> or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different >> in your distro) >> >> What's the output of this command? >> egrep 'trans|thp|compact_' /proc/vmstat >> compact_stall represents the number of processes that were stalled to do >> a compaction, the other metrics have to do with other parts of THP. If you >> see compact_stall climbing, from what I can tell those might be causing >> your spikes. I haven't found a way of telling how long the processes have >> been stalled. You could probably get a little more insight into the >> processes with some tracing assuming you can catch it quickly enough. >> Running perf top will also show the compaction happening but that doesn't >> necessarily mean it's impacting your running processes. >> >> >> >> >> On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnyd...@gmail.com> wrote: >> >>> # cat /sys/kernel/mm/redhat_transparent_hugepage/defrag >>> [always] never >>> >>> >>> On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkru...@gmail.com> wrote: >>> >>>> Just out of curiosity, are you using transparent huge pages? >>>> On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnyd...@gmail.com> wrote: >>>> >>>>> Server specs: >>>>> Dell R610 >>>>> dual E5645 hex-core 2.4GHz >>>>> 192GB RAM >>>>> RAID 1: 2x400GB SSD (OS + WAL logs) >>>>> RAID 10: 4x400GB SSD (/var/lib/pgsql) >>>>> >>>>> >>>>> /etc/sysctl.conf: >>>>> kernel.msgmnb = 65536 >>>>> kernel.msgmax = 65536 >>>>> kernel.shmmax = 68719476736 >>>>> kernel.shmall = 4294967296 >>>>> vm.overcommit_memory = 0 >>>>> vm.swappiness = 0 >>>>> vm.dirty_background_bytes = 536870912 >>>>> vm.dirty_bytes = 536870912 >>>>> >>>>> >>>>> postgresql.conf: >>>>> listen_addresses = '*' # what IP address(es) to listen on; >>>>> max_connections = 150 # (change requires restart) >>>>> shared_buffers = 48GB # min 128kB >>>>> work_mem = 1310MB # min 64kB >>>>> maintenance_work_mem = 24GB # min 1MB >>>>> wal_level = hot_standby # minimal, archive, or hot_standby >>>>> checkpoint_segments = 64 # in logfile segments, min 1, 16MB each >>>>> checkpoint_timeout = 30min # range 30s-1h >>>>> checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 >>>>> - 1.0 >>>>> max_wal_senders = 5 # max number of walsender processes >>>>> wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables >>>>> hot_standby = on # "on" allows queries during recovery >>>>> max_standby_archive_delay = 120s # max delay before canceling queries >>>>> max_standby_streaming_delay = 120s # max delay before canceling >>>>> queries >>>>> effective_cache_size = 120GB >>>>> constraint_exclusion = partition # on, off, or partition >>>>> log_destination = 'syslog' # Valid values are combinations of >>>>> logging_collector = on # Enable capturing of stderr and csvlog >>>>> log_directory = 'pg_log' # directory where log files are written, >>>>> log_filename = 'postgresql-%a.log' # log file name pattern, >>>>> log_truncate_on_rotation = on # If on, an existing log file with the >>>>> log_rotation_age = 1d # Automatic rotation of logfiles will >>>>> log_rotation_size = 0 # Automatic rotation of logfiles will >>>>> log_min_duration_statement = 500 # -1 is disabled, 0 logs all >>>>> statements >>>>> log_checkpoints = on >>>>> log_line_prefix = 'user=%u db=%d remote=%r ' # special values: >>>>> log_lock_waits = on # log lock waits >= deadlock_timeout >>>>> autovacuum = on # Enable autovacuum subprocess? 'on' >>>>> log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and >>>>> autovacuum_max_workers = 5 # max number of autovacuum subprocesses >>>>> datestyle = 'iso, mdy' >>>>> lc_messages = 'en_US.UTF-8' # locale for system error message >>>>> lc_monetary = 'en_US.UTF-8' # locale for monetary formatting >>>>> lc_numeric = 'en_US.UTF-8' # locale for number formatting >>>>> lc_time = 'en_US.UTF-8' # locale for time formatting >>>>> default_text_search_config = 'pg_catalog.english' >>>>> deadlock_timeout = 300ms >>>>> >>>>> >>>>> per pgtune: >>>>> >>>>> #------------------------------------------------------------------------------ >>>>> # pgtune wizard run on 2013-02-05 >>>>> # Based on 198333224 KB RAM in the server >>>>> >>>>> #------------------------------------------------------------------------------ >>>>> default_statistics_target = 100 >>>>> maintenance_work_mem = 1GB >>>>> checkpoint_completion_target = 0.9 >>>>> effective_cache_size = 128GB >>>>> work_mem = 1152MB >>>>> wal_buffers = 8MB >>>>> checkpoint_segments = 16 >>>>> shared_buffers = 44GB >>>>> max_connections = 80 >>>>> >>>>> We use pgbouncer (set to 140 connections) in transaction pooling mode >>>>> in front of our db. >>>>> >>>>> >>>>> The problem: >>>>> >>>>> For the most part, the server hums along. No other applications run on >>>>> this server other than postgres. Load averages rarely break 2.0, it never >>>>> swaps, and %iowait is usually not more than 0.12 >>>>> >>>>> But periodically, there are spikes in our app's db response time. >>>>> Normally, the app's db response time hovers in the 100ms range for most of >>>>> the day. During the spike times, it can go up to 1000ms or 1500ms, and the >>>>> number of pg connections goes to 140 (maxed out to pgbouncer's limit, >>>>> where >>>>> normally it's only about 20-40 connections). Also, during these times, >>>>> which usually last less than 2 minutes, we will see several thousand >>>>> queries in the pg log (this is with log_min_duration_statement = 500), >>>>> compared to maybe one or two dozen 500ms+ queries in non-spike times. >>>>> >>>>> Inbetween spikes could be an hour, two hours, sometimes half a day. >>>>> There doesn't appear to be any pattern that we can see: >>>>> * there are no obvious queries that are locking the db >>>>> * it doesn't necessarily happen during high-traffic times, though it >>>>> can >>>>> * it doesn't happen during any known system, db, or app >>>>> regularly-scheduled job, including crons >>>>> * in general, there's no discernible regularity to it at all >>>>> * it doesn't coincide with checkpoint starts or completions >>>>> * it doesn't coincide with autovacuums >>>>> * there are no messages in any system logs that might indicate any >>>>> system or hardware-related issue >>>>> >>>>> Besides spikes in our graphs, the only other visible effect is that >>>>> %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait >>>>> and all other sar variables remain the same). >>>>> >>>>> And according to our monitoring system, web requests get queued up, >>>>> and our alerting system sometimes either says there's a timeout or that it >>>>> had multiple web response times greater than 300ms, and so we suspect (but >>>>> have no proof) that some users will see either a long hang or possibly a >>>>> timeout. But since it's almost always less than two minutes, and sometimes >>>>> less than one, we don't really hear any complaints (guessing that most >>>>> people hit reload, and things work again, so they continue on), and we >>>>> haven't been able to see any negative effect ourselves. >>>>> >>>>> But we want to get in front of the problem, in case it is something >>>>> that will get worse as traffic continues to grow. We've tweaked various >>>>> configs on the OS side as well as the postgresql.conf side. What's posted >>>>> above is our current setup, and the problem persists. >>>>> >>>>> Any ideas as to where we could even look? >>>>> >>>>> Also, whether related or unrelated to the spikes, are there any >>>>> recommendations for our postgresql.conf or sysctl.conf based on our >>>>> hardware? From pgtune's output, I am lowering maintenance_work_mem from >>>>> 24GB down to maybe 2GB, but I keep reading conflicting things about other >>>>> settings, such as checkpoints or max_connections. >>>>> >>>>> johnny >>>>> >>>>> >>> >> >