On 5/15/2011 4:53 PM, Michael Moore wrote:
On Fri, May 13, 2011 at 09:01:19PM +0800, Wantao wrote:
Hi guys,

I am a PVFS2 newbie and made some performance tests using IOZone, but the 
results puzzle me. I have 16 machines. One is meta data server, and other 15 
machines are both PVFS2 IO servers and clients.  Each client machine runs one 
IOZone process, so the aggregate performance is measured. Those machines are 
configured as follows: one Intel i7-860 processor, 16GB DDR3 memory and 1TB 
SATA hard disk. They are connected through a gigabit Ethernet switch. The OS is 
Debian Lenny (2.6.26 kernel). The PVFS2 is 2.8.2 with default configuration.

The IOZone command used is: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
pvfs_client_list. Since the memory capacity for each machine is 16GB, so I set 
the test file size to 32GB to exercise the PVFS2 heavily. The result is listed 
below:
Did you have a chance to try other thread counts than 15? Specifically,
I think 4 (or 8 if hyper-threading enabled) per client would be
interesting just as a point of reference. Typically higher numbers are
seen when running with multiple threads per node.
I have numbers for 2 and 4 threads per client. Both of them also have the three questions I mentioned and all three test results have similar trends.
Record Size 4096 KB
     File size set to 33554432 KB
     Network distribution mode enabled.
     Command line used: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
pvfs_client_list
     Output is in Kbytes/sec
     Time Resolution = 0.000001 seconds.
     Processor cache size set to 1024 Kbytes.
     Processor cache line size set to 32 bytes.
     File stride size set to 17 * record size.
     Throughput test with 15 processes
     Each process writes a 33554432 Kbyte file in 4096 Kbyte records

     Test running:
     Children see throughput for 15 initial writers     =  785775.56 KB/sec
     Min throughput per process             =   50273.01 KB/sec
     Max throughput per process             =   53785.79 KB/sec
     Avg throughput per process             =   52385.04 KB/sec
     Min xfer                     = 31375360.00 KB

     Test running:
     Children see throughput for 15 rewriters     =  612876.38 KB/sec
     Min throughput per process             =   39466.78 KB/sec
     Max throughput per process             =   41843.63 KB/sec
     Avg throughput per process             =   40858.43 KB/sec
     Min xfer                     = 31649792.00 KB

     Test running:
     Children see throughput for 15 readers         =  366397.27 KB/sec
     Min throughput per process             =    9371.45 KB/sec
     Max throughput per process             =   29229.74 KB/sec
     Avg throughput per process             =   24426.48 KB/sec
     Min xfer                     = 10760192.00 KB

     Test running:
     Children see throughput for 15 re-readers     =  370985.14 KB/sec
     Min throughput per process             =    9850.98 KB/sec
     Max throughput per process             =   29660.86 KB/sec
     Avg throughput per process             =   24732.34 KB/sec
     Min xfer                     = 11145216.00 KB

     Test running:
     Children see throughput for 15 random readers     =  257970.32 KB/sec
     Min throughput per process             =    8147.65 KB/sec
     Max throughput per process             =   20084.32 KB/sec
     Avg throughput per process             =   17198.02 KB/sec
     Min xfer                     = 13615104.00 KB

     Test running:
     Children see throughput for 15 random writers     =  376059.73 KB/sec
     Min throughput per process             =   24060.38 KB/sec
     Max throughput per process             =   26446.96 KB/sec
     Avg throughput per process             =   25070.65 KB/sec
     Min xfer                     = 30527488.00 KB

I have three questions:
  1. Why does write outperforms rewrite significantly? According to  IOZone's 
document, rewrite is supposed to perform better, since it  writes to a file 
which already exists, and the metadata is already  there.
I don't have a concrete answer. One thing to try is to run the tests
separately and dump the caches from the server perspective between the
runs, like:
iozone -i 0 ....
#for server in X
ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
#done
iozone -i 4
I will try this soon and let you know what happens. One question: why "iozone -i 4"? Should I run those tests seperately like this:

iozone -i 0 ....
#for server in X
ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
#done
iozone -i 4
#for server in X
ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
#done
iozone -i 1 ...
#for server in X
ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
#done
iozone -i 2 ...


One other thing to keep in mind is that OrangeFS/PVFS doesn't have
client side cache so a re-write doesn't see the benefit of being in
cache on the client side. Not really an answer for why re-write is
slower but I would expect it to be roughly equivalent within some delta.
Yeah, that's my expectation.

If the first write gets 1/2 the file's data into cache (approximately)
on each server it may see a performance benefit that the re-write test
doesn't get since the second half of the write test blew out the cache
on each server.
What does this mean? I can not get your point. Could you please give me some explanation?
  2. Why is write/random-write faster than read/random-read so much? This  
result is really unexpected. I feel that read is supposed to be faster.  Is 
there anything wrong in my result numbers?
strip size and stride may be causing some funny access patterns. Try to
get the stride of the reads/writes to match up to the file system
stride. By default the file system strip is 64k. You'll likely want to
have that match your record size. For these tests with a 4m record size
adding a stanza inside the Filesystem context will set the default strip
to 4M:

         <Distribution>
                 Name simple_stripe
                 Param strip_size
                 Value 4194304
         </Distribution>

That should also make writes aligned with one record per server.
I will try this later. I guess it will take effects. Since now I have results for 64k record size under 64k strip size of PVFS. The test is performed in a smaller scale (4 machines totally). The results are pretty regular: write and re-write performance are close to each other; read is much faster than write.
  For 15
I/O servers try setting the stride to 15 (via -j). I don't know
what effect it will have or which test(s) are considered the 'stride
read test' but it would be a good option to check.
I did not run the "strided read" test of IOZone. Will the "-j" option work for "read" test?
  3. Observing the max and min throughput per process in each test item,  you 
can find that in write/re-write/random-write, the difference between  max and 
min is acceptable; while in read/re-read/random-read, the max  throughput is 
about two or three times of the min number. How can I  explain this result? Is 
it normal?
I wouldn't consider it normal, let's see if the changes I've mentioned
reduce the deviation.

These results are out of my expectation. Is it possible that they are caused by 
faulty hardware (network or disk) or configuration?
I don't think it's an issue of faulty hardware. I'm guessing it's a
matter of matching the file system configuration to be optimal for the
tests that are being ran.

Michael

Any advice is appreciated.

Sincerely,
Wantao
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users




_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to