Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

Ralph Castain Wed, 22 Jul 2009 16:43:19 -0400

I apologize for coming to this late - IU's email forwarding jammed upyesterday, so I'm only now getting around to reading the backlog.

There has been some off-list discussion about advanced paffinitymappings/bindings. As I noted there, I am in the latter stages ofcompleting a new mapper that allows users to easily specify #cpus to"bind" to each processor.

As part of that effort, we have interfaced to the slurm cpus_per_taskand cpuset envars. So we should (once this gets done) pretty muchhandle the slurm environment in that regard.

Having worked on the paffinity issue for some time, I am somewhatstrongly opinionated that PLPA is doing exactly what it should do. Itis up to OMPI/ORTE to identify what cpusets were assigned to the joband figure out the mappings - the PLPA is there to tell us how manyprocessors are available, how many are in each socket, etc., and to dothe mechanics of binding the specified process to the specified cpus.

I would tend to oppose any change in the relative responsibilities ofOMPI/ORTE and PLPA. It is a good division as it stands, and is workingwell. I haven't read anything in this chain that would change myopinion.


Just my $0.0002
Ralph

On Jul 22, 2009, at 11:22 AM, Jeff Squyres wrote:

On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote:
I'm interested in joining the effort, since we will likely have thesame
problem with SLURM's cpuset support.
Ok.
> But as to why it's getting EINVAL, that could be wonky. We mightwant to> take this to the PLPA list and have you run some small, non-MPIexamples to
> ensure that PLPA is parsing your /sys tree properly, etc.
I don't see the /sys implication here. Can you be more precise onwhich
files are read to determine placement ?
Check in opal/mca/paffinity/linux/plpa/src/libplpa/plpa_map.c:load_cache().
IIRC, when you are inside a cpuset, you can see all cpus (/sysshould beunmodified) but calling set_schedaffinity with a mask containing acpu
outside the cpuset will return EINVAL.
Ah, that could be the issue.
The only solution I see to solve
this would be to get the "allowed" cpus with sched_getaffinity,
which should be set according to the cpuset mask.
There are two issues here:

- what should OMPI do
- what should PLPA do

PLPA currently does two things:
1. provide a portable set/get affinity API (to isolate you fromwhatever version you have in your linux install)
2. provide topology mapping information (sockets, cores)
PLPA does not currently deal with cpusets. If we want to expandPLPA to somehow interact with cpusets, that should probably bebrought up on the PLPA mailing lists (someone made this suggestionto me about a month or two ago and I haven't had a chance to followup on it :-( ).
OMPI (as a whole -- meaning: including the ORTE layer) does thefollowing:
1. decide whether to bind MPI processes or not
2. if we do bind, use the paffinity module to bind processes tospecific processors (the linux paffinity module uses PLPA to do theactual binding -- PLPA is wholly embedded inside OMPI's linuxpaffinity module)
And there's two layers involved here:
- the main ORTE logic saying both "yes, bind" and making thedecision as to which processors to bind to- the linux paffinity component does a thin layer of translationbetween ORTE's/OMPI's requests and calling the back-end PLPA library
As Ralph described, OMPI is currently fairly "dumb" about how itchooses which processors it uses -- 0 to N-1. I think the issuehere is to make OMPI smarter about how it chooses which processorsto use. It could be in ORTE itself, or it could be in the linuxpaffinity translation layer (e.g., linux paffinity component couldreport only as many processors as are available in the cpuset...?And binding could be relative to the cpuset...?).
--
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

Reply via email to