Is there any plan to support NUMA memory binding for tasks?
Even with bind-to-core and memory affinity in 1.4.3 we were seeing 15-20% variation in run times on a Nehalem cluster. This turned out to be mostly due to bad page placement. Residual pagecache pages from the last job on a node (or the memory of a suspended job in the case of preemption) could occasionally cause a lot of non-local page placement. We hacked the libnuma module to MPOL_BIND tasks to their local memory and eliminated the majority of this variability. We are currently running with this as default behaviour since its "the right thing" for 99% of jobs (we have an environment variable to back off to affinity for the rest). I'm guessing/hoping doing the above based on hwloc will be easier/more maintainable. As a first pass, when is that likely to be an option? David