On Oct 9, 2012, at 1:45 AM, Craig James <cja...@emolecules.com> wrote:

> This is driving me crazy.  A new server, virtually identical to an old one, 
> has 50% of the performance with pgbench.  I've checked everything I can think 
> of.
> 
> The setups (call the servers "old" and "new"):
> 
> old: 2 x 4-core Intel Xeon E5620
> new: 4 x 4-core Intel Xeon E5606
> 
> both:
> 
>   memory: 12 GB DDR EC
>   Disks: 12x500GB disks (Western Digital 7200RPM SATA)
>     2 disks, RAID1: OS (ext4) and postgres xlog (ext2)
>     8 disks, RAID10: $PGDATA
> 
>   3WARE 9650SE-12ML with battery-backed cache.  The admin tool (tw_cli)
>   indicates that the battery is charged and the cache is working on both 
> units.
> 
>   Linux: 2.6.32-41-server #94-Ubuntu SMP (new server's disk was
>   actually cloned from old server).
> 
>   Postgres: 8.4.4 (yes, I should update.  But both are identical.)
> 
> The postgres.conf files are identical; diffs from the original are:
> 
>     max_connections = 500
>     shared_buffers = 1000MB
>     work_mem = 128MB
>     synchronous_commit = off
>     full_page_writes = off
>     wal_buffers = 256kB
>     checkpoint_segments = 30
>     effective_cache_size = 4GB
>     track_activities = on
>     track_counts = on
>     track_functions = none
>     autovacuum = on
>     autovacuum_naptime = 5min
>     escape_string_warning = off
> 
> Note that the old server is in production and was serving a light load while 
> this test was running, so in theory it should be slower, not faster, than the 
> new server. 
> 
> pgbench: Old server
> 
>     pgbench -i -s 100 -U test
>     pgbench -U test -c ... -t ...
> 
>     -c  -t      TPS
>      5  20000  3777
>     10  10000  2622
>     20  5000   3759
>     30  3333   5712
>     40  2500   5953
>     50  2000   6141
> 
> New server
>     -c  -t      TPS
>     5   20000  2733
>     10  10000  2783
>     20  5000   3241
>     30  3333   2987
>     40  2500   2739
>     50  2000   2119

On new server postgresql do not scale at all. Looks like contention. 

> 
> As you can see, the new server is dramatically slower than the old one.
> 
> I tested both the RAID10 data disk and the RAID1 xlog disk with bonnie++.  
> The xlog disks were almost identical in performance.  The RAID10 pg-data 
> disks looked like this:
> 
> Old server:
> Version  1.96       ------Sequential Output------ --Sequential Input- 
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
> %CP
> xenon        24064M   687  99 203098  26 81904  16  3889  96 403747  31 737.6 
>  31
> Latency             20512us     469ms     394ms   21402us     396ms     112ms
> Version  1.96       ------Sequential Create------ --------Random 
> Create--------
> xenon               -Create-- --Read--- -Delete-- -Create-- --Read--- 
> -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
> %CP
>                  16 15953  27 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ 
> +++
> Latency             43291us     857us     519us    1588us      37us     178us
> 1.96,1.96,xenon,1,1349726125,24064M,,687,99,203098,26,81904,16,3889,96,403747,31,737.6,31,16,,,,,15953,27,+++++,+++,+++++,++\
> +,+++++,+++,+++++,+++,+++++,+++,20512us,469ms,394ms,21402us,396ms,112ms,43291us,857us,519us,1588us,37us,178us
> 
> 
> New server:
> Version  1.96       ------Sequential Output------ --Sequential Input- 
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
> %CP
> zinc         24064M   862  99 212143  54 96008  14  4921  99 279239  17 752.0 
>  23
> Latency             15613us     598ms     597ms    2764us     398ms     215ms
> Version  1.96       ------Sequential Create------ --------Random 
> Create--------
> zinc                -Create-- --Read--- -Delete-- -Create-- --Read--- 
> -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
> %CP
>                  16 20380  26 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ 
> +++
> Latency               487us     627us     407us     972us      29us     262us
> 1.96,1.96,zinc,1,1349722017,24064M,,862,99,212143,54,96008,14,4921,99,279239,17,752.0,23,16,,,,,20380,26,+++++,+++,+++++,+++\
> ,+++++,+++,+++++,+++,+++++,+++,15613us,598ms,597ms,2764us,398ms,215ms,487us,627us,407us,972us,29us,262us
> 
> I don't know enough about bonnie++ to know if these differences are 
> interesting.
> 
> One dramatic difference I noted via vmstat.  On the old server, the I/O load 
> during the bonnie++ run was steady, like this:
> 
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  2  71800 2117612  17940 9375660    0    0 82948 81944 1992 1341  1  3 86 
> 10
>  0  2  71800 2113328  17948 9383896    0    0 76288 75806 1751 1167  0  2 86 
> 11
>  0  1  71800 2111004  17948 9386540   92    0 93324 94232 2230 1510  0  4 86 
> 10
>  0  1  71800 2106796  17948 9387436  114    0 67698 67588 1572 1088  0  2 87 
> 11
>  0  1  71800 2106724  17956 9387968   50    0 81970 85710 1918 1287  0  3 86 
> 10
>  1  1  71800 2103304  17956 9390700    0    0 92096 92160 1970 1194  0  4 86 
> 10
>  0  2  71800 2103196  17976 9389204    0    0 70722 69680 1655 1116  1  3 86 
> 10
>  1  1  71800 2099064  17980 9390824    0    0 57346 57348 1357  949  0  2 87 
> 11
>  0  1  71800 2095596  17980 9392720    0    0 57344 57348 1379  987  0  2 86 
> 12
> 
> But the new server varied wildly during bonnie++:
> 
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  1      0 4518352  12004 7167000    0    0 118894 120838 2613 1539  0  2 
> 93  5
>  0  1      0 4517252  12004 7167824    0    0  52116  53248 1179  793  0  1 
> 94  5
>  0  1      0 4515864  12004 7169088    0    0  46764  49152 1104  733  0  1 
> 91  7
>  0  1      0 4515180  12012 7169764    0    0  32924  30724  750  542  0  1 
> 93  6
>  0  1      0 4514328  12016 7170780    0    0  42188  45056 1019  664  0  1 
> 90  9
>  0  1      0 4513072  12016 7171856    0    0  67528  65540 1487  993  0  1 
> 96  4
>  0  1      0 4510852  12016 7173160    0    0  56876  57344 1358  942  0  1 
> 94  5
>  0  1      0 4500280  12044 7179924    0    0  91564  94220 2505 2504  1  2 
> 91  6
>  0  1      0 4495564  12052 7183492    0    0 102660 104452 2289 1473  0  2 
> 92  6
>  0  1      0 4492092  12052 7187720    0    0  98498  96274 2140 1385  0  2 
> 93  5
>  0  1      0 4488608  12060 7190772    0    0  97628 100358 2176 1398  0  1 
> 94  4
>  1  0      0 4485880  12052 7192600    0    0 112406 114686 2461 1509  0  3 
> 90  7
>  1  0      0 4483424  12052 7195612    0    0  64678  65536 1449  948  0  1 
> 91  8
>  0  1      0 4480252  12052 7199404    0    0  99608 100356 2217 1452  0  1 
> 96  3
> 

Also note the difference in free/cache distribution. Unless you took these 
numbers in completely different stages of bonnie++.

> Any ideas where to look next would be greatly appreciated.
> 
> Craig
> 

Reply via email to