On Dec 7, 2005, at 9:44 AM, Gleb Natapov wrote:
On Tue, Dec 06, 2005 at 11:07:44AM -0500, Brian Barrett wrote:
On Dec 6, 2005, at 10:53 AM, Gleb Natapov wrote:
On Tue, Dec 06, 2005 at 08:33:32AM -0700, Tim S. Woodall wrote:
Also memfree hooks decrease cache efficiency, the better solution
would
be to catch brk() system calls and remove memory from cache only
then,
but there is no way to do it for now.
We are look at other options, including catching brk/munmap system
calls, and
will be experimenting w/ these on the trunk.
This will be really interesting. How are you going to catch brk/
munmap
without kernel help? Last time I checked preload tricks don't
work if
syscall is done from inside libc itself.
All of the tricks we are looking at assume that nothing in libc calls
munmap.
glibc does call mmap/munmap internally for big allocations as
strace of
this program shows:
int main ()
{
void *p = malloc (1024*1024);
free (p);
}
Ah, yes, I wasn't clear. On Linux, we actually ship our own version
of ptmalloc2 (the allocator used by glibc on Linux). We use the
standard linker search order tricks to have the linker choose our
versions of malloc, calloc, realloc, valloc, and free, which are from
ptmalloc2. We've modified our version of ptmalloc2 such that any
time it calls mmap or sbrk with a positive number, it then
immediately allows the cache to know about the allocation. Any time
it's about to call munmap or sbrk with a negative number, it informs
the cache code before giving the memory back to the OS. We also
catch mmap and munmap so that we can track when the user calls mmap /
munmap. Note that we play with ptmalloc2's code such that it calls
our mmap (which either uses the syscall interface directly or calls
__mmap depending on what the system supports), so we don't intercept
that call to mmap twice or anything like that.
This works pretty well (like I said - it's worked fine for LAM and
MPICH-gm for years), but has the problem of requiring the user to
either use the wrapper compilers or add the -lmpi -lorte -lopal to
the link line (ie, can't use shared library dependencies to load in
libopal.so) or our ptmalloc2 / mmap / munmap isn't used. We can
detect that this happened pretty easily and then we fall back to the
pipelined RDMA code that doesn't offer the same performance but also
doesn't have a pinning problem.
We can successfully catch free() calls from inside libc
without any problems. The LAM/MPI team and Myricom (with MPICH-gm)
have been doing this for many years without any problems. On the
small percentage of MPI applications that require some linker tricks
(some of the commercial apps are this way), we won't be able to
intercept any free/munmap calls, so we're going to fall back to our
RDMA pipeline algorithm.
Yes, but catching free is not good enough. This way we sometimes evict
cache entries that may safely remains in the cache. Ideally we
should be
able to catch events that return memory to OS (munmap/brk) and
remove the
memory from cache only then.
This is essentially what we do on Linux - we only tell the rcache
code about allocations / deallocations when we are talking about
getting memory from or giving memory back to the operating system.
On Mac OS X / Darwin, due to their two level namespaces, we can't
replace malloc / free with a customized version of the Darwin
allocator like we could with ptmalloc2. There are some things you
can do to simulate such behavior, but it requires linking in a flat
namespace and doing some other things that nearly the Darwin
engineers to pass out when I was talking to them about said tricks.
So instead, we use the Darwin hooks for catching malloc / free /
etc. It's not optimal, but it's the best we can do in the
situation. And it doesn't force us to link all OMPI applications in
a flat namespace, which is always nice. Of course, we still
intercept mmap / munmap in the tradition linker tricks style. But
again, there are very few function calls in libSystem.dylib that call
mmap that we care about (malloc / free are already taken care of by
the standard hooks), so this doesn't cause a problem.
Hopefully this made some sense. If not, on to the next round of e-
mails :).
Brian
--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/