OK, I will send you the new results when I get it.

Wantao

On 5/17/2011 3:41 PM, Michael Moore wrote:
On Tue, May 17, 2011 at 08:28:02AM +0800, Wantao Liu wrote:
On 5/15/2011 4:53 PM, Michael Moore wrote:
On Fri, May 13, 2011 at 09:01:19PM +0800, Wantao wrote:
Hi guys,

I am a PVFS2 newbie and made some performance tests using IOZone, but the 
results puzzle me. I have 16 machines. One is meta data server, and other 15 
machines are both PVFS2 IO servers and clients.  Each client machine runs one 
IOZone process, so the aggregate performance is measured. Those machines are 
configured as follows: one Intel i7-860 processor, 16GB DDR3 memory and 1TB 
SATA hard disk. They are connected through a gigabit Ethernet switch. The OS is 
Debian Lenny (2.6.26 kernel). The PVFS2 is 2.8.2 with default configuration.

The IOZone command used is: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
pvfs_client_list. Since the memory capacity for each machine is 16GB, so I set 
the test file size to 32GB to exercise the PVFS2 heavily. The result is listed 
below:
Did you have a chance to try other thread counts than 15? Specifically,
I think 4 (or 8 if hyper-threading enabled) per client would be
interesting just as a point of reference. Typically higher numbers are
seen when running with multiple threads per node.
I have numbers for 2 and 4 threads per client. Both of them also have
the three questions I mentioned and all three test results have similar
trends.
Record Size 4096 KB
      File size set to 33554432 KB
      Network distribution mode enabled.
      Command line used: ./iozone -i 0 -i 1 -i 2 -r 4m -s 32g -t 15 -+m 
pvfs_client_list
      Output is in Kbytes/sec
      Time Resolution = 0.000001 seconds.
      Processor cache size set to 1024 Kbytes.
      Processor cache line size set to 32 bytes.
      File stride size set to 17 * record size.
      Throughput test with 15 processes
      Each process writes a 33554432 Kbyte file in 4096 Kbyte records

      Test running:
      Children see throughput for 15 initial writers     =  785775.56 KB/sec
      Min throughput per process             =   50273.01 KB/sec
      Max throughput per process             =   53785.79 KB/sec
      Avg throughput per process             =   52385.04 KB/sec
      Min xfer                     = 31375360.00 KB

      Test running:
      Children see throughput for 15 rewriters     =  612876.38 KB/sec
      Min throughput per process             =   39466.78 KB/sec
      Max throughput per process             =   41843.63 KB/sec
      Avg throughput per process             =   40858.43 KB/sec
      Min xfer                     = 31649792.00 KB

      Test running:
      Children see throughput for 15 readers         =  366397.27 KB/sec
      Min throughput per process             =    9371.45 KB/sec
      Max throughput per process             =   29229.74 KB/sec
      Avg throughput per process             =   24426.48 KB/sec
      Min xfer                     = 10760192.00 KB

      Test running:
      Children see throughput for 15 re-readers     =  370985.14 KB/sec
      Min throughput per process             =    9850.98 KB/sec
      Max throughput per process             =   29660.86 KB/sec
      Avg throughput per process             =   24732.34 KB/sec
      Min xfer                     = 11145216.00 KB

      Test running:
      Children see throughput for 15 random readers     =  257970.32 KB/sec
      Min throughput per process             =    8147.65 KB/sec
      Max throughput per process             =   20084.32 KB/sec
      Avg throughput per process             =   17198.02 KB/sec
      Min xfer                     = 13615104.00 KB

      Test running:
      Children see throughput for 15 random writers     =  376059.73 KB/sec
      Min throughput per process             =   24060.38 KB/sec
      Max throughput per process             =   26446.96 KB/sec
      Avg throughput per process             =   25070.65 KB/sec
      Min xfer                     = 30527488.00 KB

I have three questions:
   1. Why does write outperforms rewrite significantly? According to  IOZone's 
document, rewrite is supposed to perform better, since it  writes to a file 
which already exists, and the metadata is already  there.
I don't have a concrete answer. One thing to try is to run the tests
separately and dump the caches from the server perspective between the
runs, like:
iozone -i 0 ....
#for server in X
ssh/psh server "echo 3>   /proc/sys/vm/drop_caches"
#done
iozone -i 4
I will try this soon and let you know what happens. One question: why
"iozone -i 4"? Should I run those tests seperately like this:
The intent with iozone -i 4 was to re-run the re-write tests after the
cache was cleared to see if you got better numbers. See below for why I
thought it may make a difference. However, it sounds like with the right
stripe size it is behaving more as expected.

iozone -i 0 ....
#for server in X
ssh/psh server "echo 3>   /proc/sys/vm/drop_caches"
#done
iozone -i 4
#for server in X
ssh/psh server "echo 3>   /proc/sys/vm/drop_caches"
#done
iozone -i 1 ...
#for server in X
ssh/psh server "echo 3>   /proc/sys/vm/drop_caches"
#done
iozone -i 2 ...


One other thing to keep in mind is that OrangeFS/PVFS doesn't have
client side cache so a re-write doesn't see the benefit of being in
cache on the client side. Not really an answer for why re-write is
slower but I would expect it to be roughly equivalent within some delta.
Yeah, that's my expectation.
If the first write gets 1/2 the file's data into cache (approximately)
on each server it may see a performance benefit that the re-write test
doesn't get since the second half of the write test blew out the cache
on each server.
What does this mean? I can not get your point. Could you please give me
some explanation?
Sorry for the confusion, to clarify. Each server has 16GB of RAM. When
you run the iozone tests, 32GB of data will go to each server. For the
first 16GB of data written to each server, it will go into the server's
VFS cache. From there on, the cache will be full and stuff will need to
get flushed to disk. So, the second 16GB of data arriving at each
server will have some penalty as data is flushed from cache. For the
re-write test it will have all 32GB of data go into cache and incur that
penalty. However, if you run the tests back to back you may not see that
behavior. Just an idea why the re-write was worse, it may not be causing
any issues.

   2. Why is write/random-write faster than read/random-read so much? This  
result is really unexpected. I feel that read is supposed to be faster.  Is 
there anything wrong in my result numbers?
strip size and stride may be causing some funny access patterns. Try to
get the stride of the reads/writes to match up to the file system
stride. By default the file system strip is 64k. You'll likely want to
have that match your record size. For these tests with a 4m record size
adding a stanza inside the Filesystem context will set the default strip
to 4M:

          <Distribution>
                  Name simple_stripe
                  Param strip_size
                  Value 4194304
          </Distribution>

That should also make writes aligned with one record per server.
I will try this later. I guess it will take effects. Since now I have
results for 64k record size under 64k strip size of PVFS. The test is
performed in a smaller scale (4 machines totally). The results are
pretty regular: write and re-write performance are close to each other;
read is much faster than write.
Glad to hear it, let me know how the full runs go.

   For 15
I/O servers try setting the stride to 15 (via -j). I don't know
what effect it will have or which test(s) are considered the 'stride
read test' but it would be a good option to check.
I did not run the "strided read" test of IOZone. Will the "-j" option
work for "read" test?
I am not sure which tests the "-j" option are used. It may just be the
pread/pwrite tests. If you get a chance, it might be good to see if it
helps the throughput.

Let me know how the test go!

Michael

   3. Observing the max and min throughput per process in each test item,  you 
can find that in write/re-write/random-write, the difference between  max and 
min is acceptable; while in read/re-read/random-read, the max  throughput is 
about two or three times of the min number. How can I  explain this result? Is 
it normal?
I wouldn't consider it normal, let's see if the changes I've mentioned
reduce the deviation.

These results are out of my expectation. Is it possible that they are caused by 
faulty hardware (network or disk) or configuration?
I don't think it's an issue of faulty hardware. I'm guessing it's a
matter of matching the file system configuration to be optimal for the
tests that are being ran.

Michael

Any advice is appreciated.

Sincerely,
Wantao
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to