On Wed, Jun 12, 2013 at 09:18:34PM +0000, Jeff Squyres (jsquyres) wrote: > > Well, it creates a mess in another sense, because now you've lost > > context. When your MPI goes to do a 1byte send the kernel may well > > prefetch a few megabytes of page tables, whereas an implementation in > > userspace still has the context and can say, no I don't need that.. > > It seems like there are Big Problems on either side of this problem > (userspace and kernel). > > I thought that ummunotify was a good balance between the two -- MPI > kept its registration caches (which are annoying, but we have > long-since understood that *someone* has to maintain them), but it > gets a bulletproof way to keep them coherent. That is what is > missing in today's solutions: bulletproofness (plus we have to use > the horrid glibc malloc hooks, which are deprecated and are going > away).
Ditto. Someone has to finish the ummunotify rewrite Roland started. Realistically MPI is going to be the only user, can someone from the MPI world do this? > > It doesn't matter if there is no memory mapped to the address space, > > the address space is still there. > > > > Liran had a good example. You can register address space and then use > > mmap/munmap/MAP_FIXED to mess around with where it points to > > ...but this is not how people write applications. Real apps use > malloc (and some direct mmap, and perhaps even some shared memory). *shrug* I used MAP_FIXED for some RDMA regions in my IB verbs apps, specifically to create specalized high-performance memory structures. It isn't a general purpose technique for non-RDMA apps - but especially when combined with ODP it is useful in some places. > > A practical example of using this would be to avoid the need to send > > scatter buffer pointers to the remote. The remote writes into a memory > > ring and the ring is made 'endless' by clever use of remapping. > > I don't understand -- please explain your example a bit more...? You have a memory pool. There are two mappings to this physical memory, one for the CPU to use, one for RDMA to use. The RDMA mapping is a linear ring, the remote just spews linearly via RDMA WRITE. When messages arrive the CPU xlates the RDMA ring virtual address to the CPU address, and accesses the memory from there. It then finds a free block in the memory pool and remaps it into the RDMA pool and tells the remote that there is more free memory. >From the perspective of the remote this creates an endless, apparently linear, ring. When the CPU is done with its memory it adds it back to free block pool. At the start of time the RDMA ring maps 1:1 to the CPU pool. As xfers happen the RDMA rings maps non-linearly depending on when the CPU is done with the memory. There are lots of details to make this work, but you avoid sending s/g lists, and generally make communication more asynchronous. s/g lists are expensive. A 1GB ring requires nearly 2MB to describe with s/g lists, and a 40GB nic can turn that ring over 4 times per second! You can do something similar with sends, but sends have to pre-size buffers, wheras this scheme lets you send any size message with optimal memory usage. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
