Re: [ofa-general] Re: New proposal for memory management

Steven Truelove Thu, 30 Apr 2009 05:49:40 -0700


John A. Gregor wrote:

So, how about this:

Maintain a pool of pre-pinned pages.

When an RTS comes in, use one of the pre-pinned buffers as the place the
DATA will land.  Set up the remaining hw context to enable receipt into
the page(s) and fire back your CTS.

While the CTS is in flight and the DATA is streaming back (and you
therefore have a couple microseconds to play with), remap the virt-to-phys
mapping of the application so that the original virtual address now
points at the pre-pinned page.

A big part of the performance improvement associated with RDMA isavoiding constant page remappings and data copies. If pinning thephysical/virtual memory mapping was cheap enough to do this for eachmessage, MPI applications could simply pin and register the mapping whensending/receiving each message and then unmap when the operation wascomplete. MPI implementations maintain a cache of what memory has beenregistered because it is too expensive to map/unmap/remap memory constantly.

Copying parts of the page(s) not involved in the transfer would alsoraise overhead quite a bit for smaller RDMAs. It is quite easy to see a5 or 6K message requiring a 2-3K copy to fix the rest of a page. Andheaven help those systems with huge pages, ~1MB, in such a case.

I have seen this problem in our own MPI application. The 'simple'solution I have seen used in at least one MPI implementation for thisproblem is to prevent the malloc/free implementation being used fromever returning memory to the OS. The virtual/physical mapping can onlybecome invalid if virtual addresses are given back to the OS, thenreturned with different physical pages. Under Linux with at least, itis quite easy to tell libc to never return memory to the OS. In thiscase free() and similar functions will simply retain the memory for usewith future malloc (and similar) calls. Because the memory is neverunpinned and never given back to the OS, the physical virtual mapping isconsistent forever. I don't if other OSes make this as easy, or evenwhat systems most MPI implementors want their software to run on.

The obvious downside to this is that a process with highly irregularmemory demand will always have the memory usage of its previous peak.And because the memory is pinned, it will not even be swapped out, andwill count against the memory pinning ulimit. For many MPI applicationsthat is not a problem -- they often have quite fixed memory usage andwouldn't be returning much if any memory to the OS anyway. This is thecase for our application. I imagine someone out there has some job thatdoesn't behave so neatly, of course.



Steven Truelove

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: New proposal for memory management

Reply via email to