It appears that the memory registration code works best if the buffer size is an integer multiple of the strip_size. Making sure that is the case has allowed me to run on our EHCA without exceeding its registration limit. However I am not seeing much of any speedup by going to larger strip and/or buffer sizes. One odd thing that I have noticed in a new debug run is that even with a stip_size of 1M and a 6M buffer (we have 6 servers) it appears data is sent in 256K chunks. I have posted a new log for this run at http://www.scl.ameslab.gov/ ~brett/pvfs2-1M.log. The sends look like:
[D 10:51:33.141315] test_rq: rq 0x15501580 completed 24 from da8:3336.
[D 10:51:33.141343] BMI_testcontext completing: 91
[D 10:51:33.141415] BMI_post_send_list: addr: 11, count: 1, total_size: 262144, tag: 20
[D 10:51:33.141443]    element 0: offset: 0x14ea1688, size: 262144
[D 10:51:33.141470] BMI_ib_post_send_list: listlen 1 tag 20.
[D 10:51:33.141505] memcache_register: miss [0] 0x14ea1688 len 262144.
[D 10:51:33.141580] BMI_post_recv: addr: 11, offset: 0x15501730, size: 24, tag:

So the first question is why am I sending in 256K chunks instead of the full 1M strip size? In addition in the default config. with a 64K strip size I see:

[D 10:38:29.272429] BMI_post_send_list: addr: 12, count: 3, total_size: 196608,
tag: 21
[D 10:38:29.272465]    element 0: offset: 0x14eb16a8, size: 65536
[D 10:38:29.272491]    element 1: offset: 0x14f116a8, size: 65536
[D 10:38:29.272517]    element 2: offset: 0x14f716a8, size: 65536
[D 10:38:29.272544] BMI_ib_post_send_list: listlen 3 tag 21.
[D 10:38:29.272578] memcache_register: miss [0] 0x14eb16a8 len 65536.
[D 10:38:29.272763] memcache_register: miss [1] 0x14f116a8 len 65536.
[D 10:38:29.272941] memcache_register: miss [2] 0x14f716a8 len 65536.
[D 10:38:29.273136] BMI_post_recv: addr: 12, offset: 0x15024eb0, size: 24, tag:

I assume this means indicates some level of parallel sends for the 64K chunks. i don't see that with with the larger stripes. Perhaps that loss of parallelization is why I don't see much speedup to larger stripes?

Thanks,
Brett
On Oct 18, 2006, at 12:49 PM, Pete Wyckoff wrote:

[EMAIL PROTECTED] wrote on Wed, 18 Oct 2006 11:04 -0500:
Ok I got some debugging output finally by hardcoding in the gossip...
calls. I have posted a log file at:
http://www.scl.ameslab.gov/~brett/pvfs2.log

The app in this case is using a 1MB IO buffer to write a ~62MB file
once and then read it back in several times. The pvfs2 debug output
is mixed in with the application output, but I think its still not
too hard to follow.

Thanks, that's very helpful.  Here's a quick summary of what's going
on in the memory caching.

Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
902 times, on 902 different buffer addresses, most all 64 kB.  For
the remainder of the runtime there are no misses.  All of these
previous misses generated cached registrations which are then
reused.  Most are reused exactly 10 times, but three are used
hundreds of times, perhaps control buffers used internally by pvfs.
The reason it is 64 kB is that that is probably the stripe size
you're using for transfers.

I think why we're seeing so much time in the memcache_* functions
must be due to the length of this list of registrations.  That's a
lot of pointer chasing to get down to the on-average 451th element.
One thing I can do is put in a more reasonable data structure, but
it will still be a time-consuming function.

It appears to me that despite always being passed the same buffer the
memcache_register function almost always misses for the write. note
that the output for a run on one of the EHCA's is very similar. On
the EHCA I can write up to about 220MB before it dies with the too
much memory registered error.

This also probably explains your EHCA problem.  Those registrations
show up separately on the NIC, and maybe hit a limit there.

The bigger problem is the same one seen by most applications that
use networks that require memory registration:  program semantics do
not require users to register memory but underlying hardware does,
thus something has to patch that gap.  If you reg/dereg around every
transfer, things are very slow.  Hence we go with caching in some
middle layer to fix this up.  The same is true for MPI as well.
(The Netpipe guys had a way to cause lots of damage by sending lots
of little buffers rather than one big one, I recall.)

You probably see the buffer as a single thing, not 902 little 64 kB
chunks.  Somehow we have to communicate this information to the
message passing layer.  Fortunately you are calling PVFS_sys_write
just once with a single big buffer, not lots of times with
indivdiual chunks of the big buffer, so we have the information down
in PVFS land.  But, we have to figure out how to get this
information down to the networking layer.  The way the internal
abstractions are set up, there's no place where the network can find
out what buffer the user actually passed in.  I'm going to look
around and see if I can figure something out.

By the way, various groups keep rediscovering this problem but there
are no real appealing fixes.  When was the last time you saw anybody
use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
didn't quite complete the work needed to fully integrate it.
(Wuj's Unifier framework (CCGrid04):
    http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
)

                -- Pete

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to