Luke,

- XFS will probably generate better data rates with larger files. You really need to use the same file size as does postgresql. Why compare the speed to reading a 16G file and the speed to reading a 1G file. They won't be the same. If need be, write some code that does the test or modify lmdd to read a sequence of 1G files. Will this make a difference? You don't know until you do it. Any time you cross a couple of 2^ powers in computing, you should expect some differences.

- you did umount the file system before reading the 16G file back in? Because if you didn't then your read numbers are possibly garbage. When the read began, 8G of the file was in memory. You'd be very naive to think that somehow the read of the first 8GB somehow flushed that cached data out of memory. After all, why would the kernel flush pages from file X when you're in the middle of a sequential read of...file X? I'm not sure how Linux handles this, but Solaris would've found the 8G still in memory.

- What was the hardware and disk configuration on which these numbers were generated? For example, if you have a U320 controller, how did the read rate become larger than 320MB/s?

- how did the results change from before? Just posting the new results is misleading given all the boasting we've had to read about your past results.

- there are two results below for writing to ext2: one at 209 MB/s and one at 113MB/s. Why are they different?

- what was the cpu usage during these tests? We see postgresql doing 200+MB/s of IO. You've claimed many times that the machine would be compute bound at lower IO rates, so how much idle time does the cpu still have?

- You wrote: "We'll do a 16GB table size to ensure that we aren't reading from the read cache. " Do you really believe that?? You have to umount the file system before each test to ensure you're really measuring the disk IO rate. If I'm reading your results correctly, it looks like you have three results for ext and xfs, each of which is faster than the prior one. If I'm reading this correctly, then it looks like one is clearly reading from the read cache.

- Gee, it's so nice of you to drop your 120MB/s observation. I guess my reading at 300MB/s wasn't convincing enough. Yeah, I think it was the cpus too...

- I wouldn't focus on the flat 64% of the data rate number. It'll probably be different on other systems.

I'm all for testing and testing. It seems you still cut a corner without umounting the file system first. Maybe I'm a little too old school on this, but I wouldn't spend a dime until you've done the measurements correctly. Good Luck.
-- Alan



Luke Lonergan wrote:
Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke
================= Results ===================

Unless noted otherwise all results posted are for block device readahead set
to 16M using "blockdev --setra=16384 <block_device>".  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:

============================================================================
[EMAIL PROTECTED] dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=800000 && sync)"
800000+0 records in
800000+0 records out

real    0m33.057s
user    0m0.116s
sys     0m13.577s

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=800000 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real    0m33.032s
user    0m0.087s
sys     0m13.129s
============================================================================

So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus "ext3, data=ordered", versus xfs:

============================================================================
16GB write, then read
============================================================================
-----------------------
ext2:
-----------------------
[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 144.2670 secs, 113.5672 MB/sec

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 49.3766 secs, 331.8170 MB/sec

-----------------------
ext3, data=ordered:
-----------------------
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 137.1607 secs, 119.4511 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 48.7398 secs, 336.1527 MB/sec

-----------------------
xfs:
-----------------------
[EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=2000000 sync=1
16384.0000 MB in 52.6141 secs, 311.3994 MB/sec

[EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=2000000 sync=1
16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
============================================================================

I'm liking xfs!  Something about the way files are layed out, as Alan
suggested seems to dramatically improve write performance and perhaps
consequently the read also improves.  There doesn't seem to be a difference
between ext3 and ext2, as expected.

Now on to the Postgres 8 tests.  We'll do a 16GB table size to ensure that
we aren't reading from the read cache.  I'll write this file through
Postgres COPY to be sure that the file layout is as Postgres creates it. The
alternative would be to use COPY once, then tar/untar onto different
filesystems, but that may not duplicate the real world results.

These tests will use Bizgres 0_8_1, which is an augmented 8.0.3.  None of
the augmentations act to improve the executor I/O though, so for these
purposes it should be the same as 8.0.3.

============================================================================
26GB of DBT-3 data from the lineitem table
============================================================================
llonergan=# select relpages from pg_class where relname='lineitem';
relpages ----------
  3159138
(1 row)

3159138*8192/1000000
25879 Million Bytes, or 25.9GB

-----------------------
xfs:
-----------------------
llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 394908.501 ms
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 99425.223 ms
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 99187.205 ms

-----------------------
ext2:
-----------------------
llonergan=# select relpages from pg_class where relname='lineitem';
relpages ----------
  3159138
(1 row)

llonergan=# \timing
Timing is on.
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 395286.475 ms
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 195756.381 ms
llonergan=# select count(1) from lineitem;
count -----------
 119994608
(1 row)

Time: 122822.090 ms
============================================================================
Analysis of Postgres 8.0.3 results
============================================================================
                              ext2        xfs
Write Speed                   114         311
Read Speed                    332         407
Postgres Seq Scan Speed       212         263
Scan % of lmdd Read Speed     63.9%       64.6%

Well - looks like we get linear scaling with disk/file subsystem speedup.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to