On Wed, Jun 12, 2013 at 09:18:34PM +0000, Jeff Squyres (jsquyres) wrote:

> > Well, it creates a mess in another sense, because now you've lost
> > context. When your MPI goes to do a 1byte send the kernel may well
> > prefetch a few megabytes of page tables, whereas an implementation in
> > userspace still has the context and can say, no I don't need that..
> 
> It seems like there are Big Problems on either side of this problem
> (userspace and kernel).
> 
> I thought that ummunotify was a good balance between the two -- MPI
> kept its registration caches (which are annoying, but we have
> long-since understood that *someone* has to maintain them), but it
> gets a bulletproof way to keep them coherent.  That is what is
> missing in today's solutions: bulletproofness (plus we have to use
> the horrid glibc malloc hooks, which are deprecated and are going
> away).

Ditto.

Someone has to finish the ummunotify rewrite Roland
started. Realistically MPI is going to be the only user, can someone
from the MPI world do this?
 
> > It doesn't matter if there is no memory mapped to the address space,
> > the address space is still there.
> > 
> > Liran had a good example. You can register address space and then use
> > mmap/munmap/MAP_FIXED to mess around with where it points to
> 
> ...but this is not how people write applications.  Real apps use
> malloc (and some direct mmap, and perhaps even some shared memory).

*shrug* I used MAP_FIXED for some RDMA regions in my IB verbs apps,
specifically to create specalized high-performance memory
structures.

It isn't a general purpose technique for non-RDMA apps - but
especially when combined with ODP it is useful in some places.

> > A practical example of using this would be to avoid the need to send
> > scatter buffer pointers to the remote. The remote writes into a memory
> > ring and the ring is made 'endless' by clever use of remapping.
> 
> I don't understand -- please explain your example a bit more...?

You have a memory pool.

There are two mappings to this physical memory, one for the CPU to
use, one for RDMA to use.

The RDMA mapping is a linear ring, the remote just spews linearly via
RDMA WRITE.

When messages arrive the CPU xlates the RDMA ring virtual address to
the CPU address, and accesses the memory from there.

It then finds a free block in the memory pool and remaps it into the
RDMA pool and tells the remote that there is more free memory.

>From the perspective of the remote this creates an endless, apparently
linear, ring.

When the CPU is done with its memory it adds it back to free block
pool.

At the start of time the RDMA ring maps 1:1 to the CPU pool.  As xfers
happen the RDMA rings maps non-linearly depending on when the CPU is
done with the memory.

There are lots of details to make this work, but you avoid sending s/g
lists, and generally make communication more asynchronous.

s/g lists are expensive. A 1GB ring requires nearly 2MB to describe
with s/g lists, and a 40GB nic can turn that ring over 4 times per
second!

You can do something similar with sends, but sends have to
pre-size buffers, wheras this scheme lets you send any size message
with optimal memory usage.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to