Jeff,had you considered a notion of buffer and buffer iteration introduced by MPI/RT (The Real-Time Message Passing Interface Standard, in Concurency and Computation: Practice and Experience, Volume 16, N0 S1, pp S1-S332, Dec 2004; see Chapter 5). It basically sets up a contract of buffer (and underlying memory) ownership between MPI implementation and user. Arkady
On Thu, Apr 30, 2009 at 8:49 AM, Steven Truelove <[email protected]> wrote: > > > John A. Gregor wrote: > >> So, how about this: >> >> Maintain a pool of pre-pinned pages. >> >> When an RTS comes in, use one of the pre-pinned buffers as the place the >> DATA will land. Set up the remaining hw context to enable receipt into >> the page(s) and fire back your CTS. >> >> While the CTS is in flight and the DATA is streaming back (and you >> therefore have a couple microseconds to play with), remap the virt-to-phys >> mapping of the application so that the original virtual address now >> points at the pre-pinned page. >> > > A big part of the performance improvement associated with RDMA is avoiding > constant page remappings and data copies. If pinning the physical/virtual > memory mapping was cheap enough to do this for each message, MPI > applications could simply pin and register the mapping when > sending/receiving each message and then unmap when the operation was > complete. MPI implementations maintain a cache of what memory has been > registered because it is too expensive to map/unmap/remap memory constantly. > > Copying parts of the page(s) not involved in the transfer would also raise > overhead quite a bit for smaller RDMAs. It is quite easy to see a 5 or 6K > message requiring a 2-3K copy to fix the rest of a page. And heaven help > those systems with huge pages, ~1MB, in such a case. > > I have seen this problem in our own MPI application. The 'simple' solution > I have seen used in at least one MPI implementation for this problem is to > prevent the malloc/free implementation being used from ever returning memory > to the OS. The virtual/physical mapping can only become invalid if virtual > addresses are given back to the OS, then returned with different physical > pages. Under Linux with at least, it is quite easy to tell libc to never > return memory to the OS. In this case free() and similar functions will > simply retain the memory for use with future malloc (and similar) calls. > Because the memory is never unpinned and never given back to the OS, the > physical virtual mapping is consistent forever. I don't if other OSes make > this as easy, or even what systems most MPI implementors want their software > to run on. > > The obvious downside to this is that a process with highly irregular memory > demand will always have the memory usage of its previous peak. And because > the memory is pinned, it will not even be swapped out, and will count > against the memory pinning ulimit. For many MPI applications that is not a > problem -- they often have quite fixed memory usage and wouldn't be > returning much if any memory to the OS anyway. This is the case for our > application. I imagine someone out there has some job that doesn't behave > so neatly, of course. > > > Steven Truelove > > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Cheers, Arkady Kanevsky
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
