On Fri, May 13, 2011 at 09:01:19PM +0800, Wantao wrote:
> Hi guys,
>
> I am a PVFS2 newbie and made some performance tests using IOZone, but the
> results puzzle me. I have 16 machines. One is meta data server, and other 15
> machines are both PVFS2 IO servers and clients. Each client machine runs one
> IOZone process, so the aggregate performance is measured. Those machines are
> configured as follows: one Intel i7-860 processor, 16GB DDR3 memory and 1TB
> SATA hard disk. They are connected through a gigabit Ethernet switch. The OS
> is Debian Lenny (2.6.26 kernel). The PVFS2 is 2.8.2 with default
> configuration.
>
> The IOZone command used is: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m
> pvfs_client_list. Since the memory capacity for each machine is 16GB, so I
> set the test file size to 32GB to exercise the PVFS2 heavily. The result is
> listed below:
Did you have a chance to try other thread counts than 15? Specifically,
I think 4 (or 8 if hyper-threading enabled) per client would be
interesting just as a point of reference. Typically higher numbers are
seen when running with multiple threads per node.
>
> Record Size 4096 KB
> File size set to 33554432 KB
> Network distribution mode enabled.
> Command line used: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m
> pvfs_client_list
> Output is in Kbytes/sec
> Time Resolution = 0.000001 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
> Throughput test with 15 processes
> Each process writes a 33554432 Kbyte file in 4096 Kbyte records
>
> Test running:
> Children see throughput for 15 initial writers = 785775.56 KB/sec
> Min throughput per process = 50273.01 KB/sec
> Max throughput per process = 53785.79 KB/sec
> Avg throughput per process = 52385.04 KB/sec
> Min xfer = 31375360.00 KB
>
> Test running:
> Children see throughput for 15 rewriters = 612876.38 KB/sec
> Min throughput per process = 39466.78 KB/sec
> Max throughput per process = 41843.63 KB/sec
> Avg throughput per process = 40858.43 KB/sec
> Min xfer = 31649792.00 KB
>
> Test running:
> Children see throughput for 15 readers = 366397.27 KB/sec
> Min throughput per process = 9371.45 KB/sec
> Max throughput per process = 29229.74 KB/sec
> Avg throughput per process = 24426.48 KB/sec
> Min xfer = 10760192.00 KB
>
> Test running:
> Children see throughput for 15 re-readers = 370985.14 KB/sec
> Min throughput per process = 9850.98 KB/sec
> Max throughput per process = 29660.86 KB/sec
> Avg throughput per process = 24732.34 KB/sec
> Min xfer = 11145216.00 KB
>
> Test running:
> Children see throughput for 15 random readers = 257970.32 KB/sec
> Min throughput per process = 8147.65 KB/sec
> Max throughput per process = 20084.32 KB/sec
> Avg throughput per process = 17198.02 KB/sec
> Min xfer = 13615104.00 KB
>
> Test running:
> Children see throughput for 15 random writers = 376059.73 KB/sec
> Min throughput per process = 24060.38 KB/sec
> Max throughput per process = 26446.96 KB/sec
> Avg throughput per process = 25070.65 KB/sec
> Min xfer = 30527488.00 KB
>
> I have three questions:
> 1. Why does write outperforms rewrite significantly? According to IOZone's
> document, rewrite is supposed to perform better, since it writes to a file
> which already exists, and the metadata is already there.
I don't have a concrete answer. One thing to try is to run the tests
separately and dump the caches from the server perspective between the
runs, like:
iozone -i 0 ....
#for server in X
ssh/psh server "echo 3 > /proc/sys/vm/drop_caches"
#done
iozone -i 4
One other thing to keep in mind is that OrangeFS/PVFS doesn't have
client side cache so a re-write doesn't see the benefit of being in
cache on the client side. Not really an answer for why re-write is
slower but I would expect it to be roughly equivalent within some delta.
If the first write gets 1/2 the file's data into cache (approximately)
on each server it may see a performance benefit that the re-write test
doesn't get since the second half of the write test blew out the cache
on each server.
> 2. Why is write/random-write faster than read/random-read so much? This
> result is really unexpected. I feel that read is supposed to be faster. Is
> there anything wrong in my result numbers?
strip size and stride may be causing some funny access patterns. Try to
get the stride of the reads/writes to match up to the file system
stride. By default the file system strip is 64k. You'll likely want to
have that match your record size. For these tests with a 4m record size
adding a stanza inside the Filesystem context will set the default strip
to 4M:
<Distribution>
Name simple_stripe
Param strip_size
Value 4194304
</Distribution>
That should also make writes aligned with one record per server. For 15
I/O servers try setting the stride to 15 (via -j). I don't know
what effect it will have or which test(s) are considered the 'stride
read test' but it would be a good option to check.
> 3. Observing the max and min throughput per process in each test item, you
> can find that in write/re-write/random-write, the difference between max and
> min is acceptable; while in read/re-read/random-read, the max throughput is
> about two or three times of the min number. How can I explain this result?
> Is it normal?
I wouldn't consider it normal, let's see if the changes I've mentioned
reduce the deviation.
>
> These results are out of my expectation. Is it possible that they are caused
> by faulty hardware (network or disk) or configuration?
I don't think it's an issue of faulty hardware. I'm guessing it's a
matter of matching the file system configuration to be optimal for the
tests that are being ran.
Michael
>
> Any advice is appreciated.
>
> Sincerely,
> Wantao
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users