Re: [PERFORM] Hardware/OS recommendations for large databases (

Alan Stange Tue, 22 Nov 2005 06:28:14 -0800

Luke,

- XFS will probably generate better data rates with larger files. Youreally need to use the same file size as does postgresql. Why comparethe speed to reading a 16G file and the speed to reading a 1G file.They won't be the same. If need be, write some code that does the testor modify lmdd to read a sequence of 1G files. Will this make adifference? You don't know until you do it. Any time you cross acouple of 2^ powers in computing, you should expect some differences.

- you did umount the file system before reading the 16G file back in?Because if you didn't then your read numbers are possibly garbage.When the read began, 8G of the file was in memory. You'd be very naiveto think that somehow the read of the first 8GB somehow flushed thatcached data out of memory. After all, why would the kernel flush pagesfrom file X when you're in the middle of a sequential read of...fileX? I'm not sure how Linux handles this, but Solaris would've found the8G still in memory.

- What was the hardware and disk configuration on which these numberswere generated? For example, if you have a U320 controller, how didthe read rate become larger than 320MB/s?

- how did the results change from before? Just posting the new resultsis misleading given all the boasting we've had to read about your pastresults.

- there are two results below for writing to ext2: one at 209 MB/s andone at 113MB/s. Why are they different?

- what was the cpu usage during these tests? We see postgresql doing200+MB/s of IO. You've claimed many times that the machine would becompute bound at lower IO rates, so how much idle time does the cpustill have?

- You wrote: "We'll do a 16GB table size to ensure that we aren'treading from the read cache. " Do you really believe that?? You haveto umount the file system before each test to ensure you're reallymeasuring the disk IO rate. If I'm reading your results correctly, itlooks like you have three results for ext and xfs, each of which isfaster than the prior one. If I'm reading this correctly, then it lookslike one is clearly reading from the read cache.

- Gee, it's so nice of you to drop your 120MB/s observation. I guess myreading at 300MB/s wasn't convincing enough. Yeah, I think it was thecpus too...

- I wouldn't focus on the flat 64% of the data rate number. It'llprobably be different on other systems.

I'm all for testing and testing. It seems you still cut a cornerwithout umounting the file system first. Maybe I'm a little too oldschool on this, but I wouldn't spend a dime until you've done themeasurements correctly.Good Luck.

-- Alan



Luke Lonergan wrote:

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke

================= Results ===================


Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>".  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[EMAIL PROTECTED] dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real    0m33.057s
user    0m0.116s
sys     0m13.577s

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real    0m33.032s
user    0m0.087s
sys     0m13.129s
============================================================================

So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus xfs:

============================================================================
16GB write, then read
============================================================================
-----------------------
ext2:
-----------------------
[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 144.2670 secs, 113.5672 MB/sec

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 49.3766 secs, 331.8170 MB/sec

-----------------------
ext3, data=ordered:
-----------------------
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 137.1607 secs, 119.4511 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 48.7398 secs, 336.1527 MB/sec

-----------------------
xfs:
-----------------------
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 52.6141 secs, 311.3994 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
============================================================================

I'm liking xfs!  Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves.  There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
we aren't reading from the read cache.  I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.

============================================================================
26GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';

relpages----------

  3159138
(1 row)

3159138*8192/1000000
25879 Million Bytes, or 25.9GB

-----------------------
xfs:
-----------------------
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 99187.205 ms

-----------------------
ext2:
-----------------------
llonergan=# select relpages from pg_class where relname='lineitem';

relpages----------

  3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;

count-----------

 119994608
(1 row)

Time: 122822.090 ms
============================================================================
Analysis of Postgres 8.0.3 results
============================================================================
                              ext2        xfs
Write Speed                   114         311
Read Speed                    332         407
Postgres Seq Scan Speed       212         263
Scan % of lmdd Read Speed     63.9%       64.6%

Well - looks like we get linear scaling with disk/file subsystem speedup.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [PERFORM] Hardware/OS recommendations for large databases (

Reply via email to