On 12/14/2010 01:29 AM, Jeff Squyres wrote:
On Dec 10, 2010, at 4:56 PM, David Singleton wrote:
Is there any plan to support NUMA memory binding for tasks?
Yes.
For some details on what we're planning for affinity, see the BOF slides that I presented
at SC'10 on the OMPI web site (under "publications").
I didnt see memory binding in their explicitly.
Even with bind-to-core and memory affinity in 1.4.3 we were seeing 15-20%
variation in run times on a Nehalem cluster. This turned out to be mostly due
to bad page placement. Residual pagecache pages from the last job on a node (or
the memory of a suspended job in the case of preemption) could occasionally
cause
a lot of non-local page placement. We hacked the libnuma module to MPOL_BIND
tasks to their local memory and eliminated the majority of this variability.
We are currently running with this as default behaviour since its "the right
thing" for 99% of jobs (we have an environment variable to back off to affinity
for the rest).
What OS and libnuma version are you running? It has been my experience that libnuma can
lie on RHEL 5 and earlier. My (possibly flawed) understanding is that this is because of
lack of proper kernel support; such "proper" kernel support was only added
fairly recently (2.6.30something).
That's interesting. By "lie", do you mean processes are not really memory
bound?
We're running 2.6.27.55 (and numactl 0.9.8-11.el5) and I've done quite a bit of
testing that always looks correct.
That aside, it's somewhat disappointing that MPOL_PREFERRED is not working well
and that you had to switch to MPOL_BIND. :-(
I'm not sure its disappointing - I think it's just to be expected. For sites
that
drop caches or run a whole node memhog or reboot nodes between jobs,
MPOL_PREFERRED
will do the right thing. For sites that are not so careful or use
suspend/resume
scheduling, memory overcommits and some amount of page reclaim or paging on job
startup will happen occasionally. Paying the extra cost of making sure that
page
reclaim or paging results in ideal locality is definitely a big win for a job
overall. (Paging suspended jobs back in after they are resumed can undo some of
their ideal placement but that can be handled.)
Should we add an MCA parameter to switch between BIND and PREFERRED, and
perhaps default to BIND?
I'm not sure BIND should be the default for everyone - memory imbalanced jobs
might
page badly in this case. But, yes, we would like an MCA to choose and allow
sites
to select BIND as their default if they wish. An mpirun option like
--bind-to-mem
would need a preferred/affinity alternative and I'm not sure how of a nice
notation/
syntax for that.
Cheers,
David