On Dec 29, 2006, at 11:44 AM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Fri, 29 Dec 2006 10:43 -0500:
What performance do you typically see with a single client and single
server (not the same machine) with 10 Gb/s NICs?

1 metaserver, 1 io server, 1 client, 16 MB flow buffer sizes.  Here's
some similarly uninteresting numbers on IB, with the server running
maybe around 50%.  (Only 800 MB else I fall into swap.):

ib30$ pvfs2-cp -t /tmp/tmpfs/800m /pvfs-ib/x1
Wrote 838860800 bytes in 3.035549 seconds. 263.543767 MB/seconds

ib30$ pvfs2-cp -t -b $((64*1024*1024)) /tmp/tmpfs/800m /pvfs-ib/x1
Wrote 838860800 bytes in 2.237504 seconds. 357.541259 MB/seconds

Thanks for the sanity check.

pvfs2-cp isn't that great a code.  Find yourself an MPI interface
benchmark, like "perf".  This produces server load around 90%:

ib30$ mpiexec -n 1 2402/perf -n 10 -s 800m -c 100m -f pvfs2:/pvfs- ib/x1 #np size chunk write no sync- read no sync-- write sync---- read sync-----
#   (MB)    (MB)  (MB/s)         (MB/s)         (MB/s)         (MB/s)
1 800.0 100.0 681.56 +- 1.9 612.87 +- 2.2 679.99 +- 1.0 613.76 +- 3.0

With 1 MB flow buffers, the server is pegged and slower, more like
what you're seeing:

ib30$ mpiexec -n 1 2402/perf -n 10 -s 800m -c 100m -f pvfs2:/pvfs- ib/x1 #np size chunk write no sync- read no sync-- write sync---- read sync-----
#   (MB)    (MB)  (MB/s)         (MB/s)         (MB/s)         (MB/s)
1 800.0 100.0 342.73 +- 3.8 317.24 +- 2.6 343.96 +- 2.2 318.34 +- 1.8

Do these require the kernel module? I have not tried using that yet.

It's important to keep the flow buffer size comparable with the
network speed.  The default 256 kB is too small even for gige.
The stripe size only comes into play with multiple IO servers, and
that wants to be large too.

I do not see any improvement using larger than 1 MB with FlowBufferSizeBytes.

On the same machine, if I use dd to copy from /dev/zero to /mnt/ tmpfs/
zeros using 1 MB blocks, I get 300 MB/s for a 1 GB file.

This is wrong.  You should get 700-900 MB/s for memcpy on a recent
vintage machine.  Data in tmpfs will go to swap if you exceed the
free memory on the box.  Watch for that.

I was not swapping (I have 8 GB available). Using ramfs instead of tmpfs, I can get 1,200 MB/s. I have switched to ramfs but the numbers are roughly the same.

Initially, I used the dumbest of BMI_meth_memalloc() and
BMI_meth_memfree(), where they are simply calls to malloc() and free
(), and I was getting about 300 MB/s. Thinking that this was the
problem, I tinkered with mallopt() to set higher thresholds for trim
and mmap. This added about 50 MB/s.

Next, I added pre-malloced memory on startup and I manage a list of
these buffers. This added another 50 MB/s to get me to 400 MB/s.

IB uses malloc/free, but caches freed blocks to avoid costly
re-registrations later, handing out an old block on future malloc
calls.  You probably don't care about registration, but we added
a hook so the IO client can tell the BMI device about the
user-supplied buffer rather than seeing lots of 64 kB buffers:
BMI_OPTIMISTIC_BUFFER_REG.

What does this do? Internally, MX can cache some registrations (the API does not expose it). It is in my best interest to try to reuse buffers.

I
tried playing with pvfs2-cp's -b option but performance never
improved over the default behavior. Interestingly, on the client,
pvfs2-cp only uses two 1 MB buffers (over and over) for the entire 1
GB transfer. Is this intentional? Does this mean, that only one
buffer is in flight while the other is being filled? Is there a way
to get pvfs2-cp to use more concurrent messages?

pvfs2-cp is not exactly optimized for performance.  Don't spend too
much time worrying about it.

                -- Pete

I have found that the -b option _really_ likes multiples of 10 MB (the defaults seems to be 10 MB).

Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to