Our production database is seeing very heavy CPU utilization - anyone have any ideas/input considering the following?

CPU utilization gradually increases during the day until it approaches 90%-100% at our peak time. When this happens our transactions/sec drops and our site becomes very slow. When in this state, I can see hundreds of queries in pg_stat_activity that are not waiting on locks but sit there for minutes. When the database is not in this state, those same queries can complete in fractions of a second - faster that my script that watches pg_stat_activity can keep track of them.

This server has dual quad core xeon 5310s, 32 GB RAM, and a few different disk arrays (all managed in hardware by either the Perc5/i or Perc5/e adapter). The Postgres data is on a 14 disk 7.2k SATA raid 10. This server runs nothing but Postgres.

The PostgreSQL database (according to pg_database_size) is 55GB and we are running PostgreSQL 8.3.5 and the 2.6.28.7-2 kernel under Arch Linux.

Right now (not under peak load) this server is running at 68% CPU utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/ s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s reads, so we are barely using the available IO. Further when I run dd the CPU utilization of that process only approaches 20%-30% of one core.

Additionally, when I view "top -c" I generally see a dozen or so "idle" postgres processes (they appear and drop away quickly though) consuming very large chunks of CPU (as much as 60% of a core). At any given time we have hundreds of idle postgres processes due to the JDBC connection pooling but most of them are 0% as I would expect them to be. I also see selects and inserts consuming very large percentages of CPU but I view that as less strange since they are doing work.

Any ideas as to what is causing our CPUs to struggle? Is the fact that our RAM covers a significant portion of the database causing our CPUs to do a bunch of thrashing as they work with memory while our disk controllers sit idle? According to top we barely use any swap.

We currently have max_connections set to 1000 (roughly the sum of the JDBC pools on our application servers). Would decreasing this value help? We can decrease the JDBC pools or switch to pgbouncer for pooling if this is the case.

Really just looking for any input/ideas. Our workload is primarily OLTP in nature - essentially a social network. By transactions/sec at the start I am using the xact_commit value in pg_stat_database. Please let me know if this value is not appropriate for getting a tps guess. Right now with the 60% CPU utilization and low IO use xact_commit is increasing at a rate of 1070 a second.

I have an identical PITR slave I can pause the PITR sync on to run any test against. I will happily provide any additional information that would be helpful.

Any assistance is greatly appreciated.

Joe Uhl

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to