Hi Ken,
On Tue, Sep 10, 2013 at 11:33 AM, k...@rice.edu <k...@rice.edu> wrote: > On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote: > > Hi All, > > > > I've been seeing a strange issue with our Postgres install for about a > year > > now, and I was hoping someone might be able to help point me at the > cause. > > At what seem like fairly random intervals Postgres will become > unresponsive > > to the 3 application nodes it services. These periods tend to last for > 10 - > > 15 minutes before everything rights itself and the system goes back to > > normal. > > > > During these periods the server will report a spike in the outbound > > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in > > context switches / interrupts (normal peaks are around 2k/8k > respectively, > > and during these periods they‘ve gone to 15k/22k), and a load average of > > 100+. CPU usage stays relatively low, but it’s all system time reported, > > user time goes to zero. It doesn‘t seem to be disk related since we’re > > running with a shared_buffers setting of 24G, which will fit just about > our > > entire database into memory, and the IO transactions reported by the > > server, as well as the disk reads reported by Postgres stay consistently > > low. > > > > We‘ve recently started tracking how long statements take to execute, and > > we’re seeing some really odd numbers. A simple delete by primary key, for > > example, from a table that contains about 280,000 rows, reportedly took > > 18h59m46.900s. An update by primary key in that same table was reported > as > > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously > those > > numbers don't seem reasonable at all. > > > > Some other changes we've made to postgresql.conf: > > > > synchronous_commit = off > > > > maintenance_work_mem = 1GB > > wal_level = hot_standby > > wal_buffers = 16MB > > > > max_wal_senders = 10 > > > > wal_keep_segments = 5000 > > > > checkpoint_segments = 128 > > > > checkpoint_timeout = 30min > > > > checkpoint_completion_target = 0.9 > > > > max_connections = 500 > > > > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of > > RAM, running Cent OS 6.3. > > > > So far we‘ve tried disabling Transparent Huge Pages after I found a > number > > of resources online that indicated similar interrupt/context switch > issues, > > but it hasn’t resolve the problem. I managed to catch it happening once > and > > run a perf which showed: > > > > > > + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9 > > + 9.55% 10956 postmaster 0x2dc820 f > > set_config_option > > + 8.64% 9946 postmaster 0x5a3d4 f writeListPage > > + 5.75% 6609 postmaster 0x5a2b0 f > > ginHeapTupleFastCollect > > + 2.68% 3084 postmaster 0x192483 f > > build_implied_join_equality > > + 2.61% 2990 postmaster 0x187a55 f > > build_paths_for_OR > > + 1.86% 2131 postmaster 0x794aa f > > get_collation_oid > > + 1.56% 1822 postmaster 0x5a67e f > > ginHeapTupleFastInsert > > + 1.53% 1766 postmaster 0x1929bc f > > distribute_qual_to_rels > > + 1.33% 1558 postmaster 0x249671 f cmp_numerics > > > > I‘m not sure what 0x347ba9 represents, or why it’s an address rather > than a > > method name. > > > > That's about the sum of it. Any help would be greatly appreciated and if > > you want any more information about our setup, please feel free to ask. > > > > Thanks, > > Dave > > Hi Dave, > > A load average of 100+ means that you have that many processes waiting to > run yet you only have 16 cpus. You really need to consider using a > connection > pooler like pgbouncer to keep your connection count in the 16-32 range. > > That would make sense if the issues corresponded to increased load, but they don't. I understand that the load spike is caused by waiting processed, but it doesn't seem to correspond to a transaction spike. The number of transactions per second appear to stay in-line with normal usage when these issues occur. I do see an increase in postmaster processes when it happens, but they don't seem to have entered a transaction yet. Coupled with the fact that cpu usage is all system time, and the context switch / interrupt spikes, I feel like something must be going on behind the scenes leading to these problems. I'm just not sure what that something is. > Regards, > Ken >