On Tue, May 17, 2011 at 08:28:02AM +0800, Wantao Liu wrote:
> On 5/15/2011 4:53 PM, Michael Moore wrote:
> > On Fri, May 13, 2011 at 09:01:19PM +0800, Wantao wrote:
> >> Hi guys,
> >>
> >> I am a PVFS2 newbie and made some performance tests using IOZone, but the 
> >> results puzzle me. I have 16 machines. One is meta data server, and other 
> >> 15 machines are both PVFS2 IO servers and clients.  Each client machine 
> >> runs one IOZone process, so the aggregate performance is measured. Those 
> >> machines are configured as follows: one Intel i7-860 processor, 16GB DDR3 
> >> memory and 1TB SATA hard disk. They are connected through a gigabit 
> >> Ethernet switch. The OS is Debian Lenny (2.6.26 kernel). The PVFS2 is 
> >> 2.8.2 with default configuration.
> >>
> >> The IOZone command used is: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
> >> pvfs_client_list. Since the memory capacity for each machine is 16GB, so I 
> >> set the test file size to 32GB to exercise the PVFS2 heavily. The result 
> >> is listed below:
> > Did you have a chance to try other thread counts than 15? Specifically,
> > I think 4 (or 8 if hyper-threading enabled) per client would be
> > interesting just as a point of reference. Typically higher numbers are
> > seen when running with multiple threads per node.
> I have numbers for 2 and 4 threads per client. Both of them also have 
> the three questions I mentioned and all three test results have similar 
> trends.
> >> Record Size 4096 KB
> >>      File size set to 33554432 KB
> >>      Network distribution mode enabled.
> >>      Command line used: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
> >> pvfs_client_list
> >>      Output is in Kbytes/sec
> >>      Time Resolution = 0.000001 seconds.
> >>      Processor cache size set to 1024 Kbytes.
> >>      Processor cache line size set to 32 bytes.
> >>      File stride size set to 17 * record size.
> >>      Throughput test with 15 processes
> >>      Each process writes a 33554432 Kbyte file in 4096 Kbyte records
> >>
> >>      Test running:
> >>      Children see throughput for 15 initial writers     =  785775.56 KB/sec
> >>      Min throughput per process             =   50273.01 KB/sec
> >>      Max throughput per process             =   53785.79 KB/sec
> >>      Avg throughput per process             =   52385.04 KB/sec
> >>      Min xfer                     = 31375360.00 KB
> >>
> >>      Test running:
> >>      Children see throughput for 15 rewriters     =  612876.38 KB/sec
> >>      Min throughput per process             =   39466.78 KB/sec
> >>      Max throughput per process             =   41843.63 KB/sec
> >>      Avg throughput per process             =   40858.43 KB/sec
> >>      Min xfer                     = 31649792.00 KB
> >>
> >>      Test running:
> >>      Children see throughput for 15 readers         =  366397.27 KB/sec
> >>      Min throughput per process             =    9371.45 KB/sec
> >>      Max throughput per process             =   29229.74 KB/sec
> >>      Avg throughput per process             =   24426.48 KB/sec
> >>      Min xfer                     = 10760192.00 KB
> >>
> >>      Test running:
> >>      Children see throughput for 15 re-readers     =  370985.14 KB/sec
> >>      Min throughput per process             =    9850.98 KB/sec
> >>      Max throughput per process             =   29660.86 KB/sec
> >>      Avg throughput per process             =   24732.34 KB/sec
> >>      Min xfer                     = 11145216.00 KB
> >>
> >>      Test running:
> >>      Children see throughput for 15 random readers     =  257970.32 KB/sec
> >>      Min throughput per process             =    8147.65 KB/sec
> >>      Max throughput per process             =   20084.32 KB/sec
> >>      Avg throughput per process             =   17198.02 KB/sec
> >>      Min xfer                     = 13615104.00 KB
> >>
> >>      Test running:
> >>      Children see throughput for 15 random writers     =  376059.73 KB/sec
> >>      Min throughput per process             =   24060.38 KB/sec
> >>      Max throughput per process             =   26446.96 KB/sec
> >>      Avg throughput per process             =   25070.65 KB/sec
> >>      Min xfer                     = 30527488.00 KB
> >>
> >> I have three questions:
> >>   1. Why does write outperforms rewrite significantly? According to  
> >> IOZone's document, rewrite is supposed to perform better, since it  writes 
> >> to a file which already exists, and the metadata is already  there.
> > I don't have a concrete answer. One thing to try is to run the tests
> > separately and dump the caches from the server perspective between the
> > runs, like:
> > iozone -i 0 ....
> > #for server in X
> > ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
> > #done
> > iozone -i 4
> I will try this soon and let you know what happens. One question: why 
> "iozone -i 4"? Should I run those tests seperately like this:

The intent with iozone -i 4 was to re-run the re-write tests after the 
cache was cleared to see if you got better numbers. See below for why I 
thought it may make a difference. However, it sounds like with the right 
stripe size it is behaving more as expected.

> 
> iozone -i 0 ....
> #for server in X
> ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
> #done
> iozone -i 4
> #for server in X
> ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
> #done
> iozone -i 1 ...
> #for server in X
> ssh/psh server "echo 3>  /proc/sys/vm/drop_caches"
> #done
> iozone -i 2 ...
> 
> 
> > One other thing to keep in mind is that OrangeFS/PVFS doesn't have
> > client side cache so a re-write doesn't see the benefit of being in
> > cache on the client side. Not really an answer for why re-write is
> > slower but I would expect it to be roughly equivalent within some delta.
> Yeah, that's my expectation.
> >
> > If the first write gets 1/2 the file's data into cache (approximately)
> > on each server it may see a performance benefit that the re-write test
> > doesn't get since the second half of the write test blew out the cache
> > on each server.
> What does this mean? I can not get your point. Could you please give me 
> some explanation?

Sorry for the confusion, to clarify. Each server has 16GB of RAM. When 
you run the iozone tests, 32GB of data will go to each server. For the 
first 16GB of data written to each server, it will go into the server's 
VFS cache. From there on, the cache will be full and stuff will need to 
get flushed to disk. So, the second 16GB of data arriving at each 
server will have some penalty as data is flushed from cache. For the 
re-write test it will have all 32GB of data go into cache and incur that 
penalty. However, if you run the tests back to back you may not see that 
behavior. Just an idea why the re-write was worse, it may not be causing 
any issues.

> >>   2. Why is write/random-write faster than read/random-read so much? This  
> >> result is really unexpected. I feel that read is supposed to be faster.  
> >> Is there anything wrong in my result numbers?
> > strip size and stride may be causing some funny access patterns. Try to
> > get the stride of the reads/writes to match up to the file system
> > stride. By default the file system strip is 64k. You'll likely want to
> > have that match your record size. For these tests with a 4m record size
> > adding a stanza inside the Filesystem context will set the default strip
> > to 4M:
> >
> >          <Distribution>
> >                  Name simple_stripe
> >                  Param strip_size
> >                  Value 4194304
> >          </Distribution>
> >
> > That should also make writes aligned with one record per server.
> I will try this later. I guess it will take effects. Since now I have 
> results for 64k record size under 64k strip size of PVFS. The test is 
> performed in a smaller scale (4 machines totally). The results are 
> pretty regular: write and re-write performance are close to each other; 
> read is much faster than write.

Glad to hear it, let me know how the full runs go.

> >   For 15
> > I/O servers try setting the stride to 15 (via -j). I don't know
> > what effect it will have or which test(s) are considered the 'stride
> > read test' but it would be a good option to check.
> I did not run the "strided read" test of IOZone. Will the "-j" option 
> work for "read" test?

I am not sure which tests the "-j" option are used. It may just be the 
pread/pwrite tests. If you get a chance, it might be good to see if it 
helps the throughput.

Let me know how the test go!

Michael

> >>   3. Observing the max and min throughput per process in each test item,  
> >> you can find that in write/re-write/random-write, the difference between  
> >> max and min is acceptable; while in read/re-read/random-read, the max  
> >> throughput is about two or three times of the min number. How can I  
> >> explain this result? Is it normal?
> > I wouldn't consider it normal, let's see if the changes I've mentioned
> > reduce the deviation.
> >
> >> These results are out of my expectation. Is it possible that they are 
> >> caused by faulty hardware (network or disk) or configuration?
> > I don't think it's an issue of faulty hardware. I'm guessing it's a
> > matter of matching the file system configuration to be optimal for the
> > tests that are being ran.
> >
> > Michael
> >
> >> Any advice is appreciated.
> >>
> >> Sincerely,
> >> Wantao
> >> _______________________________________________
> >> Pvfs2-users mailing list
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
> >
> >
> 
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to