On Mon, Oct 14, 2013 at 4:42 PM, Tomas Vondra <t...@fuzzy.cz> wrote: > On 15.10.2013 01:00, Tony Kay wrote: > > Hi, > > > > I'm running 9.1.6 w/22GB shared buffers, and 32GB overall RAM on a > > 16 Opteron 6276 CPU box. We limit connections to roughly 120, but > > our webapp is configured to allocate a thread-local connection, so > > those connections are rarely doing anything more than half the time. > > Lower your shared buffers to about 20% of your RAM, unless you've tested > it's actually helping in your particular case. It's unlikely you'll get > better performance by using more than that, especially on older > versions, so it's wiser to leave the rest for page cache. > > It might even be one of the causes of the performance issue you're > seeing, as shared buffers are not exactly overhead-free. > > See this for more details on tuning: > > http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
I had followed the general directions from several sources years ago, which indicate up to 40% of RAM. We've been running very large shared buffers for 4 years now, but it is difficult to generate a good real load without testing against users, so we have not felt the need to move it around. In general, I don't tend to tinker with a setting that has been fine for this long without good reason. I've been wanting to upgrade to the newer mmap-based versions of pgsql, but was waiting to re-tune this when I did so. Why do you suspect that shared_buffers would cause the behavior I'm seeing? > > > You're on a rather old 9.1.x version, BTW. The last version in this > branch is 9.1.10 and there are some important security fixes (e.g. in > 9.1.9). Not sure if there are any fixes relevant to the performance > issue, though. > > An upgrade to 9.1.10 is planned. > A few initial questions: > > * What OS are we dealing with? > CentOS el6 > > * So how many active connections are there on average (see > pg_stat_activity for connections running queries)? > about 40-60 > > * How much data are we talking about? In total and in the imports? > 80GB database. The imports are maybe 1-3 MB...often much smaller. 10k rows would be a probably average. > > > We have been running smoothly for over a year on this configuration, > > and recently started having huge CPU spikes that bring the system to > > its knees. Given that it is a multiuser system, it has been quite > > hard to pinpoint the exact cause, but I think we've narrowed it down > > to two data import jobs that were running in semi-long transactions > > (clusters of row inserts). > > > > The tables affected by these inserts are used in common queries. > > > > The imports will bring in a row count of perhaps 10k on average > > covering 4 tables. > > > > The insert transactions are at isolation level read committed (the > > default for the JDBC driver). > > > > When the import would run (again, theory...we have not been able to > > reproduce), we would end up maxed out on CPU, with a load average of > > 50 for 16 CPUs (our normal busy usage is a load average of 5 out of > > 16 CPUs). > > > > When looking at the active queries, most of them are against the > > tables that are affected by these imports. > > Which processes consume most CPU time? Are those backends executing the > queries, or some background processes (checkpointer, autovacuum, ...)? > > The backends executing the queries...most of the queries that seem hung usually run in a few ms. > Can you post a "top -c" output collected at the time of the CPU peak? > > Don't have process accounting, so I cannot regenerate that; however, I can tell you what queries were active at one of them. There were 36 of the queries agains table ind_event (which is one affected by the import). Those queries usually take 5-10ms, and we never see more than 2 active during normal operation. These had been active for _minutes_....a sample of the running queries: time_active | datname | procpid | query -----------------+---------------------+---------+------------------------------------------- 00:08:10.891105 | tudb | 9058 | select * from mr_uss_ind_event_x where (tu 00:08:10.981845 | tudb | 8977 | select * from mr_uss_ind_event_x where (tu 00:07:08.883347 | tudb | 8930 | select * from mr_uss_ind_event_x where org 00:07:15.266393 | tudb | 8927 | select * from mr_uss_ind_event_x where org 00:07:27.587133 | tudb | 11867 | update msg_result set dt_result=$1,msg_id= 00:08:06.458885 | tudb | 8912 | select * from mr_uss_ind_event_x where org 00:06:43.036487 | tudb | 8887 | select * from mr_uss_ind_event_x where (tu 00:07:01.992367 | tudb | 8831 | select * from mr_uss_ind_event_x where (tu 00:06:59.217721 | tudb | 8816 | select * from mr_uss_ind_event_x where org 00:07:07.558848 | tudb | 8811 | update md_invoice set unbilled_amt=unbille 00:07:30.636192 | tudb | 8055 | select * from mr_uss_ind_event_x where (tu 00:07:26.761053 | tudb | 8053 | update msg_result set dt_result=$1,msg_id= 00:06:46.021084 | tudb | 8793 | select * from mr_uss_ind_event_x where (tu 00:07:26.412781 | tudb | 8041 | select * from mr_uss_ind_event_x where org 00:07:43.315019 | tudb | 8031 | select * from mr_uss_ind_event_x where org 00:07:42.651143 | tudb | 7990 | select * from mr_uss_ind_event_x where org 00:06:45.258232 | tudb | 7973 | select * from mr_uss_ind_event_x where (tu 00:07:46.135027 | tudb | 7961 | select * from mr_uss_ind_event_x where (tu 00:07:31.814513 | tudb | 7959 | select * from mr_uss_ind_event_x where (tu 00:07:27.739959 | tudb | 8221 | select * from mr_uss_ind_event_x where org 00:07:21.554369 | tudb | 8191 | select * from mr_uss_ind_event_x where org 00:07:30.665133 | tudb | 7953 | select * from mr_uss_ind_event_x where org 00:07:17.727028 | tudb | 7950 | select * from mr_uss_ind_event_x where org 00:07:25.657611 | tudb | 7948 | select * from mr_uss_ind_event_x where org 00:07:28.118856 | tudb | 7939 | select * from mr_uss_ind_event_x where org 00:07:32.436705 | tudb | 7874 | insert into mr_uss_ind_event (prelimtime_c 00:08:12.090187 | tudb | 7873 | select * from mr_uss_ind_event_x where (tu 00:07:19.181981 | tudb | 7914 | select * from mr_uss_ind_event_x where (tu 00:07:04.234119 | tudb | 7909 | select * from mr_uss_ind_event_x where (tu 00:06:52.614609 | tudb | 7856 | select * from mr_uss_ind_event_x where org 00:07:18.667903 | tudb | 7908 | select * from mr_uss_ind_event_x where (tu The insert listed there is coming from that import...the others are querying a view that includes that table in a join. Also, try to collect a few snapshots of pg_stat_bgwriter catalog before > and during the loads. Don't forget to include the timestamp: > > select now(), * from pg_stat_bgwriter; > > This is a live production system, and it will take me some doing to generate a load on a test server that triggers the condition. I'll be certain to gather this and the other stats if I can trigger it. > and when you're at it, pg_stat_database snapshots might be handy too > (not sure if you're running a single database or multiple ones), so use > either > > select now(), * from pg_stat_database; > > or > > select now(), * from pg_stat_database where datname = '..dbname..'; > > That should give us at least some insight into what's happening. > > > Our workaround (that is holding at present) was to drop the > > transactions on those imports (which is not optimal, but fortunately > > is acceptable for this particular data). This workaround has > > prevented any further incidents, but is of course inconclusive. > > > > Does this sound familiar to anyone, and if so, please advise. > > I'm wondering how this could be related to the transactions, and IIRC > the stats (e.g. # of inserted rows) are sent at commit time. That might > trigger the autovacuum. But without the transactions the autovacuum > would be triggered sooner ... > > regards > Tomas > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >