This is one of the few topics that we didn't get to discuss last week.
I think there are two main parts -- an easy part and a hard part. :-)
Easy part: the processor affinity framework and its interface
Hard part: how and when this framework is invoked in Open MPI
Processor affinity framework ([potentially] easy part)
======================================================
Framework name: paffinity
Intent:
-------
This is an extremely simple framework that is used to support the
OS-specific API for placement of processes on processors. It does
*not* decide scheduling issues -- it is simply for assigning the
current process it to a specific processor. As such, the components
are likely to be extremely short/simple -- there will likely be one
component for each OS/API that we support (e.g., Linux, IRIX, etc.).
As a direct consequence, there will likely only be one component that
is useable on a given platform (making selection easy).
Scheduling issues are discussed below (that's part of the hard part).
Base interface:
---------------
paffinity_base_open()
paffinity_base_select()
paffinity_base_close()
--> same as most other framework open, select, and close calls.
select() will use simple priority if multiple components report that
they are selectable (an unlikely situation, but still...).
Component interface:
--------------------
component.open()
component.query()
component.close()
--> same as most other component open, query, close calls
component.init()
--> initialization after selection
component.get_num_processors()
--> returns the number of processors on the machine that we can place
processes on (e.g., 4 way SMP returns 4, for multi-core and/or
hypertheaded processors, it'll be whatever we can component.set() a
process on). More specifically, this function returns the value N as
described in the discussion of component.set(), below.
component.set(id)
--> set which [virtual] CPU ID to use (0 - N-1). This may need to
remap the virtual CPU ID to a real back-end CPU ID (however the
back-end API works -- the front-end component interface presents CPUs
with ID's 0 through N-1 to the rest of the OPAL/ORTE/OMPI code base).
component.get()
--> returns which [virtual, meaning 0-(N-1)] CPU ID this process is
running on
Location:
---------
An argument can be made to put this framework in any of opal, orte, or
ompi:
- opal: it's OS/platform-dependent stuff, and that's exactly what opal
is for
- orte: it's run-time stuff, and therefore should be in ORTE
- ompi: depending on the discussion below, this may only be used for
MPI processes, and therefore belongs in the ompi tree
All things being equal, I guess I would prefer it to be in opal, but
could be swayed on this.
How and when this framework is invoked
======================================
This is the hard one -- given the simple framework above, we can easy
assign any process to any processor, but how do we decide:
a) which jobs to use processor affinity on?
b) how to assign which process to which processor?
Let's tackle these in order:
a) which jobs do we use processor affinity on?
I'm of the mind that only MPI jobs will care about this (and therefore
we should hide this early in MPI_INIT), but then again, ORTE may start
heavily using non-MPI jobs (and therefore we may want to hide this
early in orte_init()). I don't really care, either way -- if we go
with what we know today, then we should do this up at the top of
MPI_INIT (ompi_mpi_init(), that is), probably right after orte_init().
b) how do we assign a process to a given processor?
There are multiple complications here, such as the fact that we are
*not* the OS scheduler, nor the HPC batch scheduler. So it may be
possible that processes not from our ORTE job may be scheduled on the
same node as us, and we may not know it. Hence, if we just always
start assigning [virtual] CPU ID's from 0, we could run into scenarios
such as the following:
- cluster of 2-way SMPs
- cluster is run by a batch scheduler
- scheduler assigns one process from an OMPI job to node A
- scheduler assigns one process from another OMPI job to node A
In such a case, both OMPI jobs would assign themselves to CPU ID 0,
which would be a Very Bad Thing. Hence, I think we can really only use
processor affinity in two cases:
- when the scheduler system (bproc, slurm, pbs, whatever) tells us what
CPU ID's we can use
- when the user tells us what CPU ID's we can use
The first case is obvious -- we'll snarf in the information from the
scheduler and use it to invoke the framework to assign us to a CPU
(easy enough to add logic such that if multiple processes in a single
job end up on the same node and the scheduler provides info about CPU
ID's for each, we'll do the Right Things).
This scheme *should* take care of the corner case where we would not
want to use processor affinity if a node is oversubscribed -- a
scheduler should not give us CPU ID info in that case. But it's easy
enough to add detection logic and avoid the use of paffinity in such a
case.
But I think that this raises the possibility of a paffinity scheduling
framework. Some code somewhere needs to determine the scheduling of
processes to processors (e.g., reading information from a scheduler and
then mapping that to calls to the paffinity framework) -- this code may
well be scheduler-dependent (e.g., read some SLURM environment
variables in the target job). Hence, it may be worthwhile to have a
separate framework that has a[nother] simple component interface,
something like
component.what_processor_should_this_process_be_bound_to()...? This
function could return either -1 (this process should not be bound to a
processor) or some value >= 0 indicating the virtual CPU ID that this
process should be bound to.
--> Related question: do we know if any scheduler that provides
specific CPU ID information? AFAIK, BProc, SLURM, and PBS/Torque do
not. :-\
For the second case, we can provide one or more MCA params to specify
this kind of behavior (e.g., the common case will be to assume that we
own the entire node and just start scheduling -- if not oversubscribed
-- from virtual CPU ID 0).
-----
Comments?
I would like to implement this within the next week or two, so timely
comments would be appreciated (am happy to discuss higher bandwidth,
such as a telephone, if it would help). :-)
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/