On 23.08.2006 [23:11:14 -0700], Nishanth Aravamudan wrote: > On 22.08.2006 [19:08:40 -0700], Nishanth Aravamudan wrote: > > Hi, > > > > Here is my attempt at reinstating the mlocking guarantee for > > morecore. The issue previously was that we would fault in hugepages > > on the current node only, leading to terrible NUMA performance. > > Instead, we now check the current mempolicy and if it's DEFAULT > > (which acc'g to the man mbind page means "Unless the process policy > > has been changed this means to allocate memory on the node of the > > CPU that triggered the allocation.") we change it to INTERLEAVE. I > > think we want to respect the policy if it's BIND or PREFERRED, > > although maybe only the latter is really important. > > > > The NUMA API man-pages are really bad, so I'll probably spend some > > time now creating patches for them, based upon my reading of the > > corresponding kernel code. > > > > Unfortunately, this would introduce a dependency on libnuma, as > > otherwise the get_mempolicy() and mbind() calls have no definition > > :( So I'm emulating them with indirect syscalls. > > > > I'm going to go and test this now on a non-NUMA machine until I can > > find access to a larger NUMA machine where this might make a > > difference, but wanted to get the patch out there, because I'm not > > entirely sure I know what I'm doing :) > > > > Completely only an RFC right now, not requesting inclusion, so not > > Signed-off. > > Second try, compile-tested and run-tested on a non-NUMA machine (passes > make func). Will hopefully have time to test on a NUMA box tomorrow.
Reinstate the mlock() behavior in morecore to guarantee hugepages to the process, if they are available at malloc time. Rather than use libnuma to do this, which involves an extra dependency in the library, I recoded the one functionality we needed from it, which was a means to count the number of nodes in the system, by parsing /sys/devices/system/node/*. Compile-tested on ppc, ppc64, x86 and x86_64. Run-tested on an 8-memory node ppc64 box and got interleaving when HUGETLB_OVERRIDE_NUMA_POLICY was set and no interleaving when not set (or 0), as expected. This does depend on a kernel change, however, which is currently pending in -mm (http://marc.theaimsgroup.com/?l=linux-mm-commits&m=115704697415382&w=2). This is not yet requested for inclusion, as we are still waiting to hear from folks on an acceptable default policy (if INTERLEAVE is not a good one). Signed-off-by: Nishanth Aravamudan <[EMAIL PROTECTED]> --- Third try. Works as expected. I'm finally ready to sign-off. Makefile | 2 - morecore.c | 56 ++++++++++++++++++++++++++++++++++++++++++++--------- numa.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ numa.h | 46 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 158 insertions(+), 10 deletions(-) diff --git a/Makefile b/Makefile index 998f74d..a5734b1 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ PREFIX = /usr/local -LIBOBJS = hugeutils.o elflink.o morecore.o debug.o +LIBOBJS = hugeutils.o elflink.o morecore.o debug.o numa.o SBINOBJS = hugetlbd INSTALL_OBJ_LIBS = libhugetlbfs.so libhugetlbfs.a LDSCRIPT_TYPES = B BDT diff --git a/morecore.c b/morecore.c index 9f13316..33e9ba9 100644 --- a/morecore.c +++ b/morecore.c @@ -26,6 +26,9 @@ #include <sys/mman.h> #include <errno.h> #include <dlfcn.h> +#include <string.h> + +#include "numa.h" #include "hugetlbfs.h" @@ -37,6 +40,7 @@ static long blocksize; static void *heapbase; static void *heaptop; static long mapsize; +static int override_policy; /* * Our plan is to ask for pages 'roughly' at the BASE. We expect and @@ -49,10 +53,42 @@ static long mapsize; * go back to small pages and use mmap to get them. Hurrah. */ +static int guarantee_memory(void *p, long size) +{ + int ret; + + /* + * Don't override the NUMA policy unless told not to be the + * environment. + * + * Default to interleaving at fault-time to avoid having all the + * hugepages being allocated on the current node. + */ + if (numa_is_available && override_policy) { + if (syscall(__NR_mbind, p, size, MPOL_INTERLEAVE, + nodemask, NUMA_NUM_NODES+1, 0) < 0) { + WARNING("mbind() failed: %s\n", strerror(errno)); + return -1; + } + } else { + DEBUG("NUMA unavailable or process policy not overriden\n"); + } + + ret = mlock(p, size); + if (ret < 0) { + WARNING("mlock() failed: %s\n", strerror(errno)); + return -1; + } + munlock(p, size); + + return 0; +} + static void *hugetlbfs_morecore(ptrdiff_t increment) { void *p; long newsize = 0; + int ret; DEBUG("hugetlbfs_morecore(%ld) = ...\n", (long)increment); @@ -86,20 +122,13 @@ static void *hugetlbfs_morecore(ptrdiff_ return NULL; } -#if 0 -/* Use of mlock is disabled because it results in bad numa behavior since - * the malloc'd memory is allocated node-local to the cpu calling morecore() - * and not to the cpu(s) that are actually using the memory. - */ - /* Use mlock to guarantee these pages to the process */ - ret = mlock(p, newsize); + /* Use mbind and mlock to guarantee these pages to the process */ + ret = guarantee_memory(p, newsize); if (ret) { WARNING("Failed to reserve huge pages in hugetlbfs_morecore()\n"); munmap(p, newsize); return NULL; } - munlock(p, newsize); -#endif mapsize += newsize; } @@ -120,6 +149,14 @@ static void __attribute__((constructor)) if (! env) return; + /* + * 0 = inherit NUMA policy of the underlying process (default) + * 1 = override NUMA policy and interleave on all nodes + */ + env = getenv("HUGETLB_OVERRIDE_NUMA_POLICY"); + if (env) + override_policy = atoi(env); + blocksize = gethugepagesize(); if (! blocksize) { ERROR("Hugepages unavailable\n"); @@ -147,6 +184,7 @@ static void __attribute__((constructor)) DEBUG("setup_morecore(): heapaddr = 0x%lx\n", heapaddr); + setup_numa_if_available(); heaptop = heapbase = (void *)heapaddr; __morecore = &hugetlbfs_morecore; diff --git a/numa.c b/numa.c new file mode 100644 index 0000000..2794ad9 --- /dev/null +++ b/numa.c @@ -0,0 +1,64 @@ +/* + * libhugetlbfs - Easy use of Linux hugepages + * Copyright (C) 2006 Nishanth Aravamudan + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public License + * as published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#define _GNU_SOURCE + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <limits.h> + +#include "numa.h" + +#include "libhugetlbfs_internal.h" + +void setup_numa_if_available(void) +{ + int i, maxnode = -1; + char nodepath[PATH_MAX+1]; + //char sysfs_mount[PATH_MAX+1]; + + /* + * assume sysfs is mounted at /sys + * eventually should replace with a find_mountpoint() helper + */ + if (access("/sys/devices/system/node", F_OK) != 0) { + WARNING("Could not find node sysfs directory: %s (%s)\n", + nodepath, strerror(errno)); + return; + } + + numa_is_available = 1; + + while (1) { + snprintf(nodepath, PATH_MAX+1, + "/sys/devices/system/node/node%d", maxnode+1); + if (access(nodepath, F_OK) == 0) + ++maxnode; + else + break; + } + + DEBUG("maxnode = %d, NUMA_NUM_NODES = %d\n", maxnode, + NUMA_NUM_NODES); + for (i = 0; i <= maxnode; i++) + nodemask[i / BITS_PER_LONG] |= (1UL<<(i%BITS_PER_LONG)); +} diff --git a/numa.h b/numa.h new file mode 100644 index 0000000..f5068ee --- /dev/null +++ b/numa.h @@ -0,0 +1,46 @@ +/* + * libhugetlbfs - Easy use of Linux hugepages + * Copyright (C) 2006 Nishanth Aravamudan + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public License + * as published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#define _GNU_SOURCE + +#include <numaif.h> +#include <sys/syscall.h> +#include <sys/stat.h> +#include <sys/types.h> + +#include "libhugetlbfs_internal.h" + +#if defined(__x86_64__) || defined(__i386__) +#define NUMA_NUM_NODES 128 +#elif defined(__powerpc64__) || defined(__powerpc__) +#define NUMA_NUM_NODES 32 +#else +#define NUMA_NUM_NODES 2048 +#endif + +#ifndef BITS_PER_LONG +#define BITS_PER_LONG (8*sizeof(unsigned long)) +#endif + +/* adapted from libnuma source */ +int numa_is_available; +unsigned long nodemask[NUMA_NUM_NODES/BITS_PER_LONG]; + +int numa_max_node(void); +void setup_numa_if_available(void); -- Nishanth Aravamudan <[EMAIL PROTECTED]> IBM Linux Technology Center ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Libhugetlbfs-devel mailing list Libhugetlbfs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel