On Dec 7, 2005, at 9:44 AM, Gleb Natapov wrote:

On Tue, Dec 06, 2005 at 11:07:44AM -0500, Brian Barrett wrote:
On Dec 6, 2005, at 10:53 AM, Gleb Natapov wrote:

On Tue, Dec 06, 2005 at 08:33:32AM -0700, Tim S. Woodall wrote:
Also memfree hooks decrease cache efficiency, the better solution
would
be to catch brk() system calls and remove memory from cache only
then,
but there is no way to do it for now.

We are look at other options, including catching brk/munmap system
calls, and
will be experimenting w/ these on the trunk.

This will be really interesting. How are you going to catch brk/ munmap without kernel help? Last time I checked preload tricks don't work if
syscall is done from inside libc itself.

All of the tricks we are looking at assume that nothing in libc calls
munmap.

glibc does call mmap/munmap internally for big allocations as strace of
this program shows:

int main ()
{
        void *p = malloc (1024*1024);
        free (p);
}

Ah, yes, I wasn't clear. On Linux, we actually ship our own version of ptmalloc2 (the allocator used by glibc on Linux). We use the standard linker search order tricks to have the linker choose our versions of malloc, calloc, realloc, valloc, and free, which are from ptmalloc2. We've modified our version of ptmalloc2 such that any time it calls mmap or sbrk with a positive number, it then immediately allows the cache to know about the allocation. Any time it's about to call munmap or sbrk with a negative number, it informs the cache code before giving the memory back to the OS. We also catch mmap and munmap so that we can track when the user calls mmap / munmap. Note that we play with ptmalloc2's code such that it calls our mmap (which either uses the syscall interface directly or calls __mmap depending on what the system supports), so we don't intercept that call to mmap twice or anything like that.

This works pretty well (like I said - it's worked fine for LAM and MPICH-gm for years), but has the problem of requiring the user to either use the wrapper compilers or add the -lmpi -lorte -lopal to the link line (ie, can't use shared library dependencies to load in libopal.so) or our ptmalloc2 / mmap / munmap isn't used. We can detect that this happened pretty easily and then we fall back to the pipelined RDMA code that doesn't offer the same performance but also doesn't have a pinning problem.

         We can successfully catch free() calls from inside libc
without any problems.  The LAM/MPI team and Myricom (with MPICH-gm)
have been doing this for many years without any problems.  On the
small percentage of MPI applications that require some linker tricks
(some of the commercial apps are this way), we won't be able to
intercept any free/munmap calls, so we're going to fall back to our
RDMA pipeline algorithm.

Yes, but catching free is not good enough. This way we sometimes evict
cache entries that may safely remains in the cache. Ideally we should be able to catch events that return memory to OS (munmap/brk) and remove the
memory from cache only then.

This is essentially what we do on Linux - we only tell the rcache code about allocations / deallocations when we are talking about getting memory from or giving memory back to the operating system.

On Mac OS X / Darwin, due to their two level namespaces, we can't replace malloc / free with a customized version of the Darwin allocator like we could with ptmalloc2. There are some things you can do to simulate such behavior, but it requires linking in a flat namespace and doing some other things that nearly the Darwin engineers to pass out when I was talking to them about said tricks. So instead, we use the Darwin hooks for catching malloc / free / etc. It's not optimal, but it's the best we can do in the situation. And it doesn't force us to link all OMPI applications in a flat namespace, which is always nice. Of course, we still intercept mmap / munmap in the tradition linker tricks style. But again, there are very few function calls in libSystem.dylib that call mmap that we care about (malloc / free are already taken care of by the standard hooks), so this doesn't cause a problem.

Hopefully this made some sense. If not, on to the next round of e- mails :).

Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


Reply via email to