[EMAIL PROTECTED] wrote on Wed, 18 Oct 2006 11:04 -0500:
> Ok I got some debugging output finally by hardcoding in the gossip...  
> calls. I have posted a log file at:
> http://www.scl.ameslab.gov/~brett/pvfs2.log
>
> The app in this case is using a 1MB IO buffer to write a ~62MB file  
> once and then read it back in several times. The pvfs2 debug output  
> is mixed in with the application output, but I think its still not  
> too hard to follow.

Thanks, that's very helpful.  Here's a quick summary of what's going
on in the memory caching.

Your app runs 11 seconds.  For the first 2.5 sec, the cache misses
902 times, on 902 different buffer addresses, most all 64 kB.  For
the remainder of the runtime there are no misses.  All of these
previous misses generated cached registrations which are then
reused.  Most are reused exactly 10 times, but three are used
hundreds of times, perhaps control buffers used internally by pvfs.
The reason it is 64 kB is that that is probably the stripe size
you're using for transfers.

I think why we're seeing so much time in the memcache_* functions
must be due to the length of this list of registrations.  That's a
lot of pointer chasing to get down to the on-average 451th element.
One thing I can do is put in a more reasonable data structure, but
it will still be a time-consuming function.

> It appears to me that despite always being passed the same buffer the  
> memcache_register function almost always misses for the write. note  
> that the output for a run on one of the EHCA's is very similar. On  
> the EHCA I can write up to about 220MB before it dies with the too  
> much memory registered error.

This also probably explains your EHCA problem.  Those registrations
show up separately on the NIC, and maybe hit a limit there.

The bigger problem is the same one seen by most applications that
use networks that require memory registration:  program semantics do
not require users to register memory but underlying hardware does,
thus something has to patch that gap.  If you reg/dereg around every
transfer, things are very slow.  Hence we go with caching in some
middle layer to fix this up.  The same is true for MPI as well.
(The Netpipe guys had a way to cause lots of damage by sending lots
of little buffers rather than one big one, I recall.)

You probably see the buffer as a single thing, not 902 little 64 kB
chunks.  Somehow we have to communicate this information to the
message passing layer.  Fortunately you are calling PVFS_sys_write
just once with a single big buffer, not lots of times with
indivdiual chunks of the big buffer, so we have the information down
in PVFS land.  But, we have to figure out how to get this
information down to the networking layer.  The way the internal
abstractions are set up, there's no place where the network can find
out what buffer the user actually passed in.  I'm going to look
around and see if I can figure something out.

By the way, various groups keep rediscovering this problem but there
are no real appealing fixes.  When was the last time you saw anybody
use MPI_Alloc_mem?  :)  We discovered it ourselves in the context of
PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
didn't quite complete the work needed to fully integrate it.
(Wuj's Unifier framework (CCGrid04):
    http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
)

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to