Den 28/11/2012 kl. 17.54 skrev Shaun Thomas <stho...@optionshouse.com>:
> On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote: > >> https://rpm.newrelic.com/public/charts/h2dtedghfsv > > Doesn't this answer your question? > > That iowait is crushing your server into the ground. It's no surprise updates > are taking several seconds. That update you sent us *should* execute on the > order of only a few milliseconds. > > So I'll reiterate that you *must* move your pg_xlog location elsewhere. > You've got row lookup bandwidth conflicting with writes. There are a couple > other changes you should probably make to your config: > >> checkpoint_segments = 16 > > This is not enough for the workload you describe. Every time the database > checkpoints, all of those changes in pg_xlog are applied to the backend data > files. You should set these values: > > checkpoint_segments = 100 > checkpoint_timeout = 10m > checkpoint_completion_target = 0.9 > > This will reduce your overall write workload, and make it less active. Too > many checkpoints massively reduce write throughput. With the settings you > have, it's probably checkpointing constantly while your load runs. Start with > this, but experiment with increasing checkpoint_segments further. > > If you check your logs now, you probably see a ton of "checkpoint starting: > xlog" in there. That's very bad. It should say "checkpoint starting: time" > meaning it's keeping up with your writes naturally. > >> work_mem = 160MB > > This is probably way too high. work_mem is used every sort operation in a > query. So each connection could have several of these allocated, thus > starting your system of memory which will reduce that available for page > cache. Change it to 8mb, and increase it in small increments if necessary. > >> So correct me if I'm wrong here: my theory is, that I have too many >> too slow update queries, that then often end up in a situation, where >> they "wait" for each other to finish, hence the sometimes VERY long >> execution times. > > Sometimes this is the case, but for you, you're running into IO contention, > not lock contention. Your 3TB RAID-1 is simply insufficient for this workload. > > If you check your logs after making the changes I've suggested, take a look > at your checkpoint sync times. That will tell you how long it took the kernel > to physically commit those blocks to disk and get a confirmation back from > the controller. If those take longer than a second or two, you're probably > running into controller buffer overflows. You have a large amount of RAM, so > you should also make these two kernel changes to sysctl.conf: > > vm.dirty_ratio = 10 > vm.dirty_writeback_ratio = 1 > > Then run this: > > sysctl -p > > This will help prevent large IO write spikes caused when the kernel decides > to write out dirty memory. That can make checkpoints take minutes to commit > in some cases, which basically stops all write traffic to your database > entirely. > > That should get you going, anyway. You still need more/better disks so you > can move your pg_xlog directory. With your write load, that will make a huge > difference. > > -- > Shaun Thomas > OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 > 312-444-8534 > stho...@optionshouse.com > > ______________________________________________ > > See http://www.peak6.com/email_disclaimer/ for terms and conditions related > to this email Okay, now I'm done the updating as described above. I did the postgres.conf changes. I did the kernel changes, i added two SSD's in a software RAID1 where the pg_xlog is now located - unfortunately the the picture is still the same :-( When the database is under "heavy" load, there is almost no improvement to see in the performance compared to before the changes. A lot of both read and writes takes more than a 1000 times as long as they usually do, under "lighter" overall load. I added All the overview charts I can get hold on from new relic beneath. What am I overlooking? There must be an obvious bottleneck? Where should I dive in? Database server CPU usage https://rpm.newrelic.com/public/charts/cEdIvvoQZCr Database server load average https://rpm.newrelic.com/public/charts/cMNdrYW51QJ Database server physical memory https://rpm.newrelic.com/public/charts/c3dZBntNpa1 Database server disk I/O utulization https://rpm.newrelic.com/public/charts/9YEVw6RekFG Database server network I/O (Mb/s) https://rpm.newrelic.com/public/charts/lKiZ0Szmwe7 Top 5 database operations by wall clock time https://rpm.newrelic.com/public/charts/dCt45YH12FK Database throughput https://rpm.newrelic.com/public/charts/bIbtQ1mDzMI Database response time https://rpm.newrelic.com/public/charts/fPcNL8WA6xx -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance