Hello Mark, if i was you i would start with making basic benchmarks regarding IO performance (bonnie++ maybe?) and i would check my sysctl.conf. your parameters look ok to me. when these delays occur have you noticed whats causing it ? An output of vmstat when the delays are happening would also help.
Vasilis Ventirozos On Thu, Feb 21, 2013 at 11:59 AM, Mark Smith <smithmark...@gmail.com> wrote: > Hardware: IBM X3650 M3 (2 x Xeon X5680 6C 3.33GHz), 96GB RAM. IBM X3524 > with RAID 10 ext4 (noatime,nodiratime,data=writeback,barrier=0) volumes for > pg_xlog / data / indexes. > > Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4. > max_connections = 1500 > shared_buffers = 16GB > work_mem = 64MB > maintenance_work_mem = 256MB > wal_level = archive > synchronous_commit = off > wal_buffers = 16MB > checkpoint_segments = 32 > checkpoint_completion_target = 0.9 > effective_cache_size = 32GB > Workload: OLTP, typically with 500+ concurrent database connections, same > Linux instance is also used as web server and application server. Far from > ideal but has worked well for 15 months. > > Problem: We have been running PostgreSQL 9.0.4 on SLES11 SP1, last kernel > in use was 2.6.32-43-0.4, performance has always been great. Since updating > from SLES11 SP1 to SP2 we now experience many database 'stalls' (e.g. > normally 'instant' queries taking many seconds, any query will be slow, > just connecting to the database will be slow). We have trialled PostgreSQL > 9.2.3 under SLES11 SP2 with the exact same results. During these periods > the machine is completely responsive but anything accessing the database is > extremely slow. > > I have tried increasing sched_migration_cost from 500000 to 5000000 and > also tried setting sched_compat_yield to 1, neither of these appeared to > make a difference. I don't have the parameter 'sched_autogroup_enabled'. > Nothing jumps out from top/iostat/sar/pg_stat_activity however I am very > far from expert in interpreting their output > > We have work underway to reduce our number of connections as although it > has always worked ok, perhaps it makes us particularly vulnerable to > kernel/scheduler changes. > > I would be very grateful for any suggestions as to the best way to > diagnose the source of this problem and/or general recommendations? >