Den 28/11/2012 kl. 17.54 skrev Shaun Thomas <stho...@optionshouse.com>:

> On 11/28/2012 10:19 AM, Niels Kristian Schjødt wrote:
> 
>> https://rpm.newrelic.com/public/charts/h2dtedghfsv
> 
> Doesn't this answer your question?
> 
> That iowait is crushing your server into the ground. It's no surprise updates 
> are taking several seconds. That update you sent us *should* execute on the 
> order of only a few milliseconds.
> 
> So I'll reiterate that you *must* move your pg_xlog location elsewhere. 
> You've got row lookup bandwidth conflicting with writes. There are a couple 
> other changes you should probably make to your config:
> 
>> checkpoint_segments = 16
> 
> This is not enough for the workload you describe. Every time the database 
> checkpoints, all of those changes in pg_xlog are applied to the backend data 
> files. You should set these values:
> 
> checkpoint_segments = 100
> checkpoint_timeout = 10m
> checkpoint_completion_target = 0.9
> 
> This will reduce your overall write workload, and make it less active. Too 
> many checkpoints massively reduce write throughput. With the settings you 
> have, it's probably checkpointing constantly while your load runs. Start with 
> this, but experiment with increasing checkpoint_segments further.
> 
> If you check your logs now, you probably see a ton of "checkpoint starting: 
> xlog" in there. That's very bad. It should say "checkpoint starting: time" 
> meaning it's keeping up with your writes naturally.
> 
>> work_mem = 160MB
> 
> This is probably way too high. work_mem is used every sort operation in a 
> query. So each connection could have several of these allocated, thus 
> starting your system of memory which will reduce that available for page 
> cache. Change it to 8mb, and increase it in small increments if necessary.
> 
>> So correct me if I'm wrong here: my theory is, that I have too many
>> too slow update queries, that then often end up in a situation, where
>> they "wait" for each other to finish, hence the sometimes VERY long
>> execution times.
> 
> Sometimes this is the case, but for you, you're running into IO contention, 
> not lock contention. Your 3TB RAID-1 is simply insufficient for this workload.
> 
> If you check your logs after making the changes I've suggested, take a look 
> at your checkpoint sync times. That will tell you how long it took the kernel 
> to physically commit those blocks to disk and get a confirmation back from 
> the controller. If those take longer than a second or two, you're probably 
> running into controller buffer overflows. You have a large amount of RAM, so 
> you should also make these two kernel changes to sysctl.conf:
> 
> vm.dirty_ratio = 10
> vm.dirty_writeback_ratio = 1
> 
> Then run this:
> 
> sysctl -p
> 
> This will help prevent large IO write spikes caused when the kernel decides 
> to write out dirty memory. That can make checkpoints take minutes to commit 
> in some cases, which basically stops all write traffic to your database 
> entirely.
> 
> That should get you going, anyway. You still need more/better disks so you 
> can move your pg_xlog directory. With your write load, that will make a huge 
> difference.
> 
> -- 
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
> 312-444-8534
> stho...@optionshouse.com
> 
> ______________________________________________
> 
> See http://www.peak6.com/email_disclaimer/ for terms and conditions related 
> to this email

Okay, now I'm done the updating as described above. I did the postgres.conf 
changes. I did the kernel changes, i added two SSD's in a software RAID1 where 
the pg_xlog is now located - unfortunately the the picture is still the same 
:-( 
When the database is under "heavy" load, there is almost no improvement to see 
in the performance compared to before the changes. A lot of both read and 
writes takes more than a 1000 times as long as they usually do, under "lighter" 
overall load. 

I added All the overview charts I can get hold on from new relic beneath. What 
am I overlooking? There must be an obvious bottleneck? Where should I dive in?

Database server CPU usage
https://rpm.newrelic.com/public/charts/cEdIvvoQZCr

Database server load average
https://rpm.newrelic.com/public/charts/cMNdrYW51QJ

Database server physical memory
https://rpm.newrelic.com/public/charts/c3dZBntNpa1

Database server disk I/O utulization
https://rpm.newrelic.com/public/charts/9YEVw6RekFG

Database server network I/O (Mb/s)
https://rpm.newrelic.com/public/charts/lKiZ0Szmwe7

Top 5 database operations by wall clock time
https://rpm.newrelic.com/public/charts/dCt45YH12FK

Database throughput
https://rpm.newrelic.com/public/charts/bIbtQ1mDzMI

Database response time
https://rpm.newrelic.com/public/charts/fPcNL8WA6xx

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to