Re: [O-MPI devel] Linux processor affinity

Jeff Squyres Mon, 12 Dec 2005 16:40:55 -0500

To make this significantly easier, I called Paul and we discussedthis at length.


In short -- we ended up agreeing with you.  :-)

As a personal sidenote -- it sucks that we all had to do this muchresearch to figure this out. In particular, we missed the fact thatall the kernel versions take 3 arguments (we thought that some took2), and that's where some of the reasons for the initial approachcame from.

So we'll implement this as a syscall() and use the getaffinitysyscall to probe for the correct length (some kernels require <=sizeof(long), some require == sizeof(long), and some are ok with >=sizeof(long)). Using syscall() cuts out the potentially-buggymiddleman (glibc), and removes a layer of indirection that is*usually* able to be deduced, but there's little reason not to usesyscall directly.

There are some older systems out there that do not have syscall(),but I don't think we care about them (i.e., we can check for that inconfigure). Plus, those systems won't have processor affinity, anyway.

Behind the scenes, Paul and I have been working on a standalonelibrary to handle all this junk called Portable Linux ProcessorAffinity (PLPA). The SVN is hosted on svn.open-mpi.org -- we'll openit up in a few days (i.e., after we adjust to the syscall()interface). This library will be released under the BSD license anda) is really pretty small, b) but most importantly, allows otherdevelopers using Linux processor affinity to not worry about any ofthese horrid details. The PLPA will have its own web page andmailing list, too.


Thanks for your diligence in pestering us about this!  :-)


On Dec 12, 2005, at 10:32 AM, Bogdan Costescu wrote:

On Fri, 9 Dec 2005, Paul H. Hargrove wrote:

If one looks though enough kernel versions,


In the meantime, I've gotten a copy of kernel/sched.c from a SGI Prism
kernel - I assume that it is the same used on Altix; this one has in
the Makefile EXTRAVERSION = -sgi306rp31. So again, all prototypes of
the sys_sched_setaffinity function that I've seen so far have 3
args... which means that no compiler tricks are needed to keep 3
different copies of the function.

one finds that some of them differ in what they will accept for the
len.


OK, so this is a different problem...

Some produce EINVAL if len!=sizeof(long),


I beg to disagree. All the codes that I looked at test for

len < sizeof(new_mask)

and copy user data based on the size of new_mask, so if "len" is
larger than sizeof(new_mask), no error occurs.

others (especially Altix) produce EINVAL if len is too short to
cover all the machine's CPUs.


...so IMHO this test should be used instead to separate a long from a
(larger) cpumask_t.

In the message that described your implementation you also wrote:

while on other kernels I find that a too-short mask is padded w/
zeros and no error results. So, we want a big value for len


Indeed some (more recent) kernels pad with zeros if "len" is too
short. But a "big value for len" is again wrong.

I can see 4 cases, again by looking at the kernel code and not dealing
with 2 vs. 3 args:

1. tests for len < sizeof(long) and copies only sizeof(len) if larger
(backported 2.4 in RHEL3); this can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and copies only sizeof(len) if
larger (backported 2.4 from SGI, 2.6.3 from Mandrake 10.0); this can
be identified by passing "len" shorter than sizeof(cpumask_t) which
returns -EINVAL and then passing "len" of (or larger than)
sizeof(cpu_size_t) which should not return error.

3. tests for len < sizeof(cpumask_t) and pads with zeros if true,
otherwise copies only sizeof(cpumask_t) (2.6.9 in RHEL4 and 2.6.14).
This can't really be identified as it doesn't return -EINVAL in any
situation.

As you can see your suggestion to set "big value for len" would
successfully pass _all_ of the above conditions and would therefore
not offer any separation between the cases.

The stuff above applies to the _set function; the _get function is a
bit different:

1. tests for len < sizeof(long) and returns -EINVAL if true.
(backported 2.4 in RHEL3). This can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and returns -EINVAL if true.
(backported 2.4 from SGI, 2.6.3 from Mandraks 10.0, 2.6.9 from RHEL4,
2.6.14). This can be identified by passing "len" smaller than
sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or
larger than) sizeof(cpumask_t) which should not return error.

Case 1. of _set is associated to case 1. of _get.
Cases 2. and 3. of _set are both associated to case 2. of _get.

So IMHO the test should be made with the _get function (as explained
in a previous message), by setting len=sizeof(long) which would allow
the case 1. to work fine, while case 2. would return -EINVAL, exactly
opposite from the code that you proposed.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Re: [O-MPI devel] Linux processor affinity

Reply via email to