This is one of the few topics that we didn't get to discuss last week. I think there are two main parts -- an easy part and a hard part. :-)

Easy part: the processor affinity framework and its interface
Hard part: how and when this framework is invoked in Open MPI

Processor affinity framework ([potentially] easy part)
======================================================

Framework name: paffinity

Intent:
-------

This is an extremely simple framework that is used to support the OS-specific API for placement of processes on processors. It does *not* decide scheduling issues -- it is simply for assigning the current process it to a specific processor. As such, the components are likely to be extremely short/simple -- there will likely be one component for each OS/API that we support (e.g., Linux, IRIX, etc.). As a direct consequence, there will likely only be one component that is useable on a given platform (making selection easy).

Scheduling issues are discussed below (that's part of the hard part).

Base interface:
---------------

paffinity_base_open()
paffinity_base_select()
paffinity_base_close()
--> same as most other framework open, select, and close calls. select() will use simple priority if multiple components report that they are selectable (an unlikely situation, but still...).

Component interface:
--------------------

component.open()
component.query()
component.close()
--> same as most other component open, query, close calls

component.init()
--> initialization after selection

component.get_num_processors()
--> returns the number of processors on the machine that we can place processes on (e.g., 4 way SMP returns 4, for multi-core and/or hypertheaded processors, it'll be whatever we can component.set() a process on). More specifically, this function returns the value N as described in the discussion of component.set(), below.

component.set(id)
--> set which [virtual] CPU ID to use (0 - N-1). This may need to remap the virtual CPU ID to a real back-end CPU ID (however the back-end API works -- the front-end component interface presents CPUs with ID's 0 through N-1 to the rest of the OPAL/ORTE/OMPI code base).

component.get()
--> returns which [virtual, meaning 0-(N-1)] CPU ID this process is running on

Location:
---------

An argument can be made to put this framework in any of opal, orte, or ompi:

- opal: it's OS/platform-dependent stuff, and that's exactly what opal is for
- orte: it's run-time stuff, and therefore should be in ORTE
- ompi: depending on the discussion below, this may only be used for MPI processes, and therefore belongs in the ompi tree

All things being equal, I guess I would prefer it to be in opal, but could be swayed on this.

How and when this framework is invoked
======================================

This is the hard one -- given the simple framework above, we can easy assign any process to any processor, but how do we decide:

a) which jobs to use processor affinity on?
b) how to assign which process to which processor?

Let's tackle these in order:

a) which jobs do we use processor affinity on?

I'm of the mind that only MPI jobs will care about this (and therefore we should hide this early in MPI_INIT), but then again, ORTE may start heavily using non-MPI jobs (and therefore we may want to hide this early in orte_init()). I don't really care, either way -- if we go with what we know today, then we should do this up at the top of MPI_INIT (ompi_mpi_init(), that is), probably right after orte_init().

b) how do we assign a process to a given processor?

There are multiple complications here, such as the fact that we are *not* the OS scheduler, nor the HPC batch scheduler. So it may be possible that processes not from our ORTE job may be scheduled on the same node as us, and we may not know it. Hence, if we just always start assigning [virtual] CPU ID's from 0, we could run into scenarios such as the following:

- cluster of 2-way SMPs
- cluster is run by a batch scheduler
- scheduler assigns one process from an OMPI job to node A
- scheduler assigns one process from another OMPI job to node A

In such a case, both OMPI jobs would assign themselves to CPU ID 0, which would be a Very Bad Thing. Hence, I think we can really only use processor affinity in two cases:

- when the scheduler system (bproc, slurm, pbs, whatever) tells us what CPU ID's we can use
- when the user tells us what CPU ID's we can use

The first case is obvious -- we'll snarf in the information from the scheduler and use it to invoke the framework to assign us to a CPU (easy enough to add logic such that if multiple processes in a single job end up on the same node and the scheduler provides info about CPU ID's for each, we'll do the Right Things).

This scheme *should* take care of the corner case where we would not want to use processor affinity if a node is oversubscribed -- a scheduler should not give us CPU ID info in that case. But it's easy enough to add detection logic and avoid the use of paffinity in such a case.

But I think that this raises the possibility of a paffinity scheduling framework. Some code somewhere needs to determine the scheduling of processes to processors (e.g., reading information from a scheduler and then mapping that to calls to the paffinity framework) -- this code may well be scheduler-dependent (e.g., read some SLURM environment variables in the target job). Hence, it may be worthwhile to have a separate framework that has a[nother] simple component interface, something like component.what_processor_should_this_process_be_bound_to()...? This function could return either -1 (this process should not be bound to a processor) or some value >= 0 indicating the virtual CPU ID that this process should be bound to.

--> Related question: do we know if any scheduler that provides specific CPU ID information? AFAIK, BProc, SLURM, and PBS/Torque do not. :-\

For the second case, we can provide one or more MCA params to specify this kind of behavior (e.g., the common case will be to assume that we own the entire node and just start scheduling -- if not oversubscribed -- from virtual CPU ID 0).

-----

Comments?

I would like to implement this within the next week or two, so timely comments would be appreciated (am happy to discuss higher bandwidth, such as a telephone, if it would help). :-)

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to