The bigger problem is the same one seen by most applications that
use networks that require memory registration: program semantics do
not require users to register memory but underlying hardware does,
thus something has to patch that gap. If you reg/dereg around every
transfer, things are very slow. Hence we go with caching in some
middle layer to fix this up. The same is true for MPI as well.
(The Netpipe guys had a way to cause lots of damage by sending lots
of little buffers rather than one big one, I recall.)
The NetPIPE guy(s), which is me right now, do this by doing a ping-
pong of say a 128k message, but we send the message from a different
address each time. This beats up memory registration caches nicely ;)
We call this NetPIPE's cache-invalidate mode. It was originally
written to address stuff that ended up in CPU caches, but it works
quite nicely to break other caches as well.
The NetPIPE pvfs module, when run with cache invalidate, effectively
writes sequentially, but from a different buffer every time as well,
and we end up seeing the same behavior, and breakage on the ehca.
By the way, various groups keep rediscovering this problem but there
are no real appealing fixes. When was the last time you saw anybody
use MPI_Alloc_mem? :) We discovered it ourselves in the context of
PVFS back in 2003 or thereabouts, and took a stab at fixing it, but
didn't quite complete the work needed to fully integrate it.
(Wuj's Unifier framework (CCGrid04):
http://www.osc.edu/~pw/papers/wu-unifier-ccgrid04.pdf
)
-- Pete
The solution for a kernel hacker like me is obvious, you allow the OS
kernel memory management and network driver handle the memory pinning
and interaction with the hardware. This way an application can just
call the OS to register the entire application memory space, and the
OS kernel can deal with keeping it all pinned down, and if it needs
to unpin something, it can do so.
The catch is that it requires the hardware to support keeping an
address registered, but *not* physically pinned, and *ask nicely* to
the OS via the page fault handler to pin the page back down if
something comes in. This seems to be an idea that RDMA hardware
designers just can't wrap their heads around. I guess they are too
used to dealing with OS'es that never change.
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers