[O-MPI devel] processor affinity

Jeff Squyres Tue, 2 Aug 2005 14:56:03 -0500

This is one of the few topics that we didn't get to discuss last week.I think there are two main parts -- an easy part and a hard part. :-)


Easy part: the processor affinity framework and its interface
Hard part: how and when this framework is invoked in Open MPI


Processor affinity framework ([potentially] easy part)
======================================================

Framework name: paffinity

Intent:
-------

This is an extremely simple framework that is used to support theOS-specific API for placement of processes on processors. It does*not* decide scheduling issues -- it is simply for assigning thecurrent process it to a specific processor. As such, the componentsare likely to be extremely short/simple -- there will likely be onecomponent for each OS/API that we support (e.g., Linux, IRIX, etc.).As a direct consequence, there will likely only be one component thatis useable on a given platform (making selection easy).


Scheduling issues are discussed below (that's part of the hard part).

Base interface:
---------------

paffinity_base_open()
paffinity_base_select()
paffinity_base_close()

--> same as most other framework open, select, and close calls.select() will use simple priority if multiple components report thatthey are selectable (an unlikely situation, but still...).


Component interface:
--------------------

component.open()
component.query()
component.close()
--> same as most other component open, query, close calls

component.init()
--> initialization after selection

component.get_num_processors()

--> returns the number of processors on the machine that we can placeprocesses on (e.g., 4 way SMP returns 4, for multi-core and/orhypertheaded processors, it'll be whatever we can component.set() aprocess on). More specifically, this function returns the value N asdescribed in the discussion of component.set(), below.


component.set(id)

--> set which [virtual] CPU ID to use (0 - N-1). This may need toremap the virtual CPU ID to a real back-end CPU ID (however theback-end API works -- the front-end component interface presents CPUswith ID's 0 through N-1 to the rest of the OPAL/ORTE/OMPI code base).


component.get()

--> returns which [virtual, meaning 0-(N-1)] CPU ID this process isrunning on


Location:
---------

An argument can be made to put this framework in any of opal, orte, orompi:

- opal: it's OS/platform-dependent stuff, and that's exactly what opalis for

- orte: it's run-time stuff, and therefore should be in ORTE

- ompi: depending on the discussion below, this may only be used forMPI processes, and therefore belongs in the ompi tree

All things being equal, I guess I would prefer it to be in opal, butcould be swayed on this.


How and when this framework is invoked
======================================

This is the hard one -- given the simple framework above, we can easyassign any process to any processor, but how do we decide:


a) which jobs to use processor affinity on?
b) how to assign which process to which processor?

Let's tackle these in order:

a) which jobs do we use processor affinity on?

I'm of the mind that only MPI jobs will care about this (and thereforewe should hide this early in MPI_INIT), but then again, ORTE may startheavily using non-MPI jobs (and therefore we may want to hide thisearly in orte_init()). I don't really care, either way -- if we gowith what we know today, then we should do this up at the top ofMPI_INIT (ompi_mpi_init(), that is), probably right after orte_init().


b) how do we assign a process to a given processor?

There are multiple complications here, such as the fact that we are*not* the OS scheduler, nor the HPC batch scheduler. So it may bepossible that processes not from our ORTE job may be scheduled on thesame node as us, and we may not know it. Hence, if we just alwaysstart assigning [virtual] CPU ID's from 0, we could run into scenariossuch as the following:


- cluster of 2-way SMPs
- cluster is run by a batch scheduler
- scheduler assigns one process from an OMPI job to node A
- scheduler assigns one process from another OMPI job to node A

In such a case, both OMPI jobs would assign themselves to CPU ID 0,which would be a Very Bad Thing. Hence, I think we can really only useprocessor affinity in two cases:

- when the scheduler system (bproc, slurm, pbs, whatever) tells us whatCPU ID's we can use

- when the user tells us what CPU ID's we can use

The first case is obvious -- we'll snarf in the information from thescheduler and use it to invoke the framework to assign us to a CPU(easy enough to add logic such that if multiple processes in a singlejob end up on the same node and the scheduler provides info about CPUID's for each, we'll do the Right Things).

This scheme *should* take care of the corner case where we would notwant to use processor affinity if a node is oversubscribed -- ascheduler should not give us CPU ID info in that case. But it's easyenough to add detection logic and avoid the use of paffinity in such acase.

But I think that this raises the possibility of a paffinity schedulingframework. Some code somewhere needs to determine the scheduling ofprocesses to processors (e.g., reading information from a scheduler andthen mapping that to calls to the paffinity framework) -- this code maywell be scheduler-dependent (e.g., read some SLURM environmentvariables in the target job). Hence, it may be worthwhile to have aseparate framework that has a[nother] simple component interface,something likecomponent.what_processor_should_this_process_be_bound_to()...? Thisfunction could return either -1 (this process should not be bound to aprocessor) or some value >= 0 indicating the virtual CPU ID that thisprocess should be bound to.

--> Related question: do we know if any scheduler that providesspecific CPU ID information? AFAIK, BProc, SLURM, and PBS/Torque donot. :-\

For the second case, we can provide one or more MCA params to specifythis kind of behavior (e.g., the common case will be to assume that weown the entire node and just start scheduling -- if not oversubscribed-- from virtual CPU ID 0).


-----

Comments?

I would like to implement this within the next week or two, so timelycomments would be appreciated (am happy to discuss higher bandwidth,such as a telephone, if it would help). :-)


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

[O-MPI devel] processor affinity

Reply via email to