Thank you to you both. I modified the allocator to allocate one large block using hwloc_alloc and then use one thread per numa domain to touch each page according to the tiling pattern - unfortunately, I hadn't appreciated that now hwloc_get_area_membind_nodeset always returns the full machine numa mask - and not the numa domain that the page was touched by (I guess it only gives the expected answer when set_area_membind is used first)
I had hoped to use a dynamic query of the pages (using the first one of a given tile) to schedule each task that operates on a given tile to run on the numa node that touched it. I can work around this by using a matrix offset calculation to get the numa node, but if there's a way of querying the page directly - then please let me know. Thanks JB ________________________________________ From: hwloc-users [hwloc-users-boun...@lists.open-mpi.org] on behalf of Samuel Thibault [samuel.thiba...@inria.fr] Sent: 12 November 2017 10:48 To: Hardware locality user list Subject: Re: [hwloc-users] question about hwloc_set_area_membind_nodeset Brice Goglin, on dim. 12 nov. 2017 05:19:37 +0100, wrote: > That's likely what's happening. Each set_area() may be creating a new "virtual > memory area". The kernel tries to merge them with neighbors if they go to the > same NUMA node. Otherwise it creates a new VMA. Mmmm, that sucks. Ideally we'd have a way to ask the kernel not to strictly bind the memory, but just to allocate on a given memory node, and just hope that the allocation will not go away (e.g. due to swapping), which thus doesn't need a VMA to record the information. As you describe below, first-touch achieves that but it's not necessarily so convenient. > I can't find the exact limit but it's something like 64k so I guess > you're exhausting that. It's sysctl vm.max_map_count > Question 2 : Is there a better way of achieving the result I'm looking for > (such as a call to membind with a stride of some kind to say put N pages > in > a row on each domain in alternation). > > > Unfortunately, the interleave policy doesn't have a stride argument. It's one > page on node 0, one page on node 1, etc. > > The only idea I have is to use the first-touch policy: Make sure your buffer > isn't is physical memory yet, and have a thread on node 0 read the "0" pages, > and another thread on node 1 read the "1" page. Or "next-touch" if that was to ever get merged into mainline Linux :) Samuel _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/hwloc-users