John A. Gregor wrote:
So, how about this:

Maintain a pool of pre-pinned pages.

When an RTS comes in, use one of the pre-pinned buffers as the place the
DATA will land.  Set up the remaining hw context to enable receipt into
the page(s) and fire back your CTS.

While the CTS is in flight and the DATA is streaming back (and you
therefore have a couple microseconds to play with), remap the virt-to-phys
mapping of the application so that the original virtual address now
points at the pre-pinned page.

A big part of the performance improvement associated with RDMA is avoiding constant page remappings and data copies. If pinning the physical/virtual memory mapping was cheap enough to do this for each message, MPI applications could simply pin and register the mapping when sending/receiving each message and then unmap when the operation was complete. MPI implementations maintain a cache of what memory has been registered because it is too expensive to map/unmap/remap memory constantly.

Copying parts of the page(s) not involved in the transfer would also raise overhead quite a bit for smaller RDMAs. It is quite easy to see a 5 or 6K message requiring a 2-3K copy to fix the rest of a page. And heaven help those systems with huge pages, ~1MB, in such a case.

I have seen this problem in our own MPI application. The 'simple' solution I have seen used in at least one MPI implementation for this problem is to prevent the malloc/free implementation being used from ever returning memory to the OS. The virtual/physical mapping can only become invalid if virtual addresses are given back to the OS, then returned with different physical pages. Under Linux with at least, it is quite easy to tell libc to never return memory to the OS. In this case free() and similar functions will simply retain the memory for use with future malloc (and similar) calls. Because the memory is never unpinned and never given back to the OS, the physical virtual mapping is consistent forever. I don't if other OSes make this as easy, or even what systems most MPI implementors want their software to run on.

The obvious downside to this is that a process with highly irregular memory demand will always have the memory usage of its previous peak. And because the memory is pinned, it will not even be swapped out, and will count against the memory pinning ulimit. For many MPI applications that is not a problem -- they often have quite fixed memory usage and wouldn't be returning much if any memory to the OS anyway. This is the case for our application. I imagine someone out there has some job that doesn't behave so neatly, of course.


Steven Truelove

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to