To make this significantly easier, I called Paul and we discussed this at length.

In short -- we ended up agreeing with you.  :-)

As a personal sidenote -- it sucks that we all had to do this much research to figure this out. In particular, we missed the fact that all the kernel versions take 3 arguments (we thought that some took 2), and that's where some of the reasons for the initial approach came from.

So we'll implement this as a syscall() and use the getaffinity syscall to probe for the correct length (some kernels require <= sizeof(long), some require == sizeof(long), and some are ok with >= sizeof(long)). Using syscall() cuts out the potentially-buggy middleman (glibc), and removes a layer of indirection that is *usually* able to be deduced, but there's little reason not to use syscall directly.

There are some older systems out there that do not have syscall(), but I don't think we care about them (i.e., we can check for that in configure). Plus, those systems won't have processor affinity, anyway.

Behind the scenes, Paul and I have been working on a standalone library to handle all this junk called Portable Linux Processor Affinity (PLPA). The SVN is hosted on svn.open-mpi.org -- we'll open it up in a few days (i.e., after we adjust to the syscall() interface). This library will be released under the BSD license and a) is really pretty small, b) but most importantly, allows other developers using Linux processor affinity to not worry about any of these horrid details. The PLPA will have its own web page and mailing list, too.

Thanks for your diligence in pestering us about this!  :-)


On Dec 12, 2005, at 10:32 AM, Bogdan Costescu wrote:

On Fri, 9 Dec 2005, Paul H. Hargrove wrote:

If one looks though enough kernel versions,

In the meantime, I've gotten a copy of kernel/sched.c from a SGI Prism
kernel - I assume that it is the same used on Altix; this one has in
the Makefile EXTRAVERSION = -sgi306rp31. So again, all prototypes of
the sys_sched_setaffinity function that I've seen so far have 3
args... which means that no compiler tricks are needed to keep 3
different copies of the function.

one finds that some of them differ in what they will accept for the
len.

OK, so this is a different problem...

Some produce EINVAL if len!=sizeof(long),

I beg to disagree. All the codes that I looked at test for

len < sizeof(new_mask)

and copy user data based on the size of new_mask, so if "len" is
larger than sizeof(new_mask), no error occurs.

others (especially Altix) produce EINVAL if len is too short to
cover all the machine's CPUs.

...so IMHO this test should be used instead to separate a long from a
(larger) cpumask_t.

In the message that described your implementation you also wrote:

while on other kernels I find that a too-short mask is padded w/
zeros and no error results. So, we want a big value for len

Indeed some (more recent) kernels pad with zeros if "len" is too
short. But a "big value for len" is again wrong.

I can see 4 cases, again by looking at the kernel code and not dealing
with 2 vs. 3 args:

1. tests for len < sizeof(long) and copies only sizeof(len) if larger
(backported 2.4 in RHEL3); this can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and copies only sizeof(len) if
larger (backported 2.4 from SGI, 2.6.3 from Mandrake 10.0); this can
be identified by passing "len" shorter than sizeof(cpumask_t) which
returns -EINVAL and then passing "len" of (or larger than)
sizeof(cpu_size_t) which should not return error.

3. tests for len < sizeof(cpumask_t) and pads with zeros if true,
otherwise copies only sizeof(cpumask_t) (2.6.9 in RHEL4 and 2.6.14).
This can't really be identified as it doesn't return -EINVAL in any
situation.

As you can see your suggestion to set "big value for len" would
successfully pass _all_ of the above conditions and would therefore
not offer any separation between the cases.

The stuff above applies to the _set function; the _get function is a
bit different:

1. tests for len < sizeof(long) and returns -EINVAL if true.
(backported 2.4 in RHEL3). This can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and returns -EINVAL if true.
(backported 2.4 from SGI, 2.6.3 from Mandraks 10.0, 2.6.9 from RHEL4,
2.6.14). This can be identified by passing "len" smaller than
sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or
larger than) sizeof(cpumask_t) which should not return error.

Case 1. of _set is associated to case 1. of _get.
Cases 2. and 3. of _set are both associated to case 2. of _get.

So IMHO the test should be made with the _get function (as explained
in a previous message), by setting len=sizeof(long) which would allow
the case 1. to work fine, while case 2. would return -EINVAL, exactly
opposite from the code that you proposed.

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Reply via email to