Re: [O-MPI devel] Linux processor affinity

2005-12-12 Thread Jeff Squyres
To make this significantly easier, I called Paul and we discussed  
this at length.


In short -- we ended up agreeing with you.  :-)

As a personal sidenote -- it sucks that we all had to do this much  
research to figure this out.  In particular, we missed the fact that  
all the kernel versions take 3 arguments (we thought that some took  
2), and that's where some of the reasons for the initial approach  
came from.


So we'll implement this as a syscall() and use the getaffinity  
syscall to probe for the correct length (some kernels require <=  
sizeof(long), some require == sizeof(long), and some are ok with >=  
sizeof(long)).  Using syscall() cuts out the potentially-buggy  
middleman (glibc), and removes a layer of indirection that is  
*usually* able to be deduced, but there's little reason not to use  
syscall directly.


There are some older systems out there that do not have syscall(),  
but I don't think we care about them (i.e., we can check for that in  
configure).  Plus, those systems won't have processor affinity, anyway.


Behind the scenes, Paul and I have been working on a standalone  
library to handle all this junk called Portable Linux Processor  
Affinity (PLPA).  The SVN is hosted on svn.open-mpi.org -- we'll open  
it up in a few days (i.e., after we adjust to the syscall()  
interface).  This library will be released under the BSD license and  
a) is really pretty small, b) but most importantly, allows other  
developers using Linux processor affinity to not worry about any of  
these horrid details.  The PLPA will have its own web page and  
mailing list, too.


Thanks for your diligence in pestering us about this!  :-)


On Dec 12, 2005, at 10:32 AM, Bogdan Costescu wrote:


On Fri, 9 Dec 2005, Paul H. Hargrove wrote:


If one looks though enough kernel versions,


In the meantime, I've gotten a copy of kernel/sched.c from a SGI Prism
kernel - I assume that it is the same used on Altix; this one has in
the Makefile EXTRAVERSION = -sgi306rp31. So again, all prototypes of
the sys_sched_setaffinity function that I've seen so far have 3
args... which means that no compiler tricks are needed to keep 3
different copies of the function.


one finds that some of them differ in what they will accept for the
len.


OK, so this is a different problem...


Some produce EINVAL if len!=sizeof(long),


I beg to disagree. All the codes that I looked at test for

len < sizeof(new_mask)

and copy user data based on the size of new_mask, so if "len" is
larger than sizeof(new_mask), no error occurs.


others (especially Altix) produce EINVAL if len is too short to
cover all the machine's CPUs.


...so IMHO this test should be used instead to separate a long from a
(larger) cpumask_t.

In the message that described your implementation you also wrote:


while on other kernels I find that a too-short mask is padded w/
zeros and no error results. So, we want a big value for len


Indeed some (more recent) kernels pad with zeros if "len" is too
short. But a "big value for len" is again wrong.

I can see 4 cases, again by looking at the kernel code and not dealing
with 2 vs. 3 args:

1. tests for len < sizeof(long) and copies only sizeof(len) if larger
(backported 2.4 in RHEL3); this can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and copies only sizeof(len) if
larger (backported 2.4 from SGI, 2.6.3 from Mandrake 10.0); this can
be identified by passing "len" shorter than sizeof(cpumask_t) which
returns -EINVAL and then passing "len" of (or larger than)
sizeof(cpu_size_t) which should not return error.

3. tests for len < sizeof(cpumask_t) and pads with zeros if true,
otherwise copies only sizeof(cpumask_t) (2.6.9 in RHEL4 and 2.6.14).
This can't really be identified as it doesn't return -EINVAL in any
situation.

As you can see your suggestion to set "big value for len" would
successfully pass _all_ of the above conditions and would therefore
not offer any separation between the cases.

The stuff above applies to the _set function; the _get function is a
bit different:

1. tests for len < sizeof(long) and returns -EINVAL if true.
(backported 2.4 in RHEL3). This can be identified by passing "len"
smaller than sizeof(long) which returns -EINVAL and then passing "len"
of (or larger than) sizeof(long) which should not return error.

2. tests for len < sizeof(cpumask_t) and returns -EINVAL if true.
(backported 2.4 from SGI, 2.6.3 from Mandraks 10.0, 2.6.9 from RHEL4,
2.6.14). This can be identified by passing "len" smaller than
sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or
larger than) sizeof(cpumask_t) which should not return error.

Case 1. of _set is associated to case 1. of _get.
Cases 2. and 3. of _set are both associated to case 2. of _get.

So IMHO the test should be made with the _get function (as explained
in a 

Re: [O-MPI devel] Linux processor affinity

2005-12-09 Thread Paul H. Hargrove
Just recently finished checking.  For the collection of Linux hosts I 
have access to, the probe results are the same regardless of the choice 
of set or get.  I agree 100% that "get" is a safer probe.


-Paul

Jeff Squyres wrote:

On Dec 9, 2005, at 3:06 PM, Bogdan Costescu wrote:


 rc = sched_setaffinity(0, sizeof(mask), mask);

This changes whatever affinity might have been set before this check,
for example by a (smart, don't know if such exists now) batch system.
I haven't checked if it's possible, but I think that a similar
solution based on sched_getaffinity would be much better, as this
would not disturb the current settings.


Paul and I were discussing this earlier (off list).  He was  
investigating doing the same check with sched_getaffinity() -- I  
don't know if he has finished checking into that already.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [O-MPI devel] Linux processor affinity

2005-12-09 Thread Bogdan Costescu

On Thu, 8 Dec 2005, Jeff Squyres wrote:


This is friggen' amazing.


Let me disagree with you here... and not because I proposed a 
different solution. ;-)



 rc = sched_setaffinity(0, sizeof(mask), mask);


This changes whatever affinity might have been set before this check, 
for example by a (smart, don't know if such exists now) batch system. 
I haven't checked if it's possible, but I think that a similar 
solution based on sched_getaffinity would be much better, as this 
would not disturb the current settings.


--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [O-MPI devel] Linux processor affinity

2005-12-09 Thread Paul H. Hargrove
If one looks though enough kernel versions, one finds that some of them 
differ in what they will accept for the len.  Some produce EINVAL if 
len!=sizeof(long), others (especially Altix) produce EINVAL if len is 
too short to cover all the machine's CPUs.  I think I recall finding one 
that was even happy w/ len==0.  So, even if one ignores the 2-argument 
version in some 2.5.x kernels, the caller needs to know if the len to 
pass should always be sizeof(long), or if it should reflect the true 
number of CPUs present (as one must on an Altix).


-Paul

Bogdan Costescu wrote:

On Thu, 8 Dec 2005, Jeff Squyres wrote:

Check out http://svn.open-mpi.org/svn/ompi/trunk/opal/mca/paffinity/ 
linux/paffinity_linux.h -- there's a big comment in that file about 
the problem, to include descriptions of the 3 APIs.


I'm sorry, but that is not quite what I wrote about in my message. The 
comments refer to the _glibc_ view of the functions, at least I 
couldn't see how they map to my reading of the _kernel_ source code.
Let's take one that is specifically mentioned there: Mandrake 10.0, 
kernel based on 2.6.3, in file kernel/sched.c there is the function:


/**
  * sys_sched_setaffinity - set the cpu affinity of a process
  * @pid: pid of the process
  * @len: length in bytes of the bitmask pointed to by user_mask_ptr
  * @user_mask_ptr: user-space pointer to the new cpu mask
  */
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
   unsigned long __user *user_mask_ptr)

which again has 3 arguments that look exactly like the ones that I 
mentioned previously. I don't have access to the source code of 
the SGI Altix kernel, so I can't check the other one mentioned there 
as a 2-args function. But so far all _kernel_ prototypes of the 
function that I have looked at are exactly the same with 3 arguments.


The solution that I proposed works much like a statically linked 
binary - it calls via a syscall the _kernel_ function that has a 
constant (so far) prototype. It doesn't call the _glibc_ function that 
changes prototype.





--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [O-MPI devel] Linux processor affinity

2005-12-09 Thread Bogdan Costescu

On Thu, 8 Dec 2005, Jeff Squyres wrote:

Check out http://svn.open-mpi.org/svn/ompi/trunk/opal/mca/paffinity/ 
linux/paffinity_linux.h -- there's a big comment in that file about 
the problem, to include descriptions of the 3 APIs.


I'm sorry, but that is not quite what I wrote about in my message. The 
comments refer to the _glibc_ view of the functions, at least I 
couldn't see how they map to my reading of the _kernel_ source code.
Let's take one that is specifically mentioned there: Mandrake 10.0, 
kernel based on 2.6.3, in file kernel/sched.c there is the function:


/**
 * sys_sched_setaffinity - set the cpu affinity of a process
 * @pid: pid of the process
 * @len: length in bytes of the bitmask pointed to by user_mask_ptr
 * @user_mask_ptr: user-space pointer to the new cpu mask
 */
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
  unsigned long __user *user_mask_ptr)

which again has 3 arguments that look exactly like the ones that I 
mentioned previously. I don't have access to the source code of 
the SGI Altix kernel, so I can't check the other one mentioned there 
as a 2-args function. But so far all _kernel_ prototypes of the 
function that I have looked at are exactly the same with 3 arguments.


The solution that I proposed works much like a statically linked 
binary - it calls via a syscall the _kernel_ function that has a 
constant (so far) prototype. It doesn't call the _glibc_ function that 
changes prototype.


--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [O-MPI devel] Linux processor affinity

2005-12-08 Thread Jeff Squyres

On Nov 29, 2005, at 2:51 PM, Paul H. Hargrove wrote:


The result is the following, which I've tried in limited testing:


Holy Crimminey, Batman -- this message slipped by me in my INBOX.

This is friggen' amazing.

Many thanks, Paul!


enum {
   SCHED_SETAFFINITY_TAKES_2_ARGS,
   SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG,
   SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET,
   SCHED_SETAFFINITY_UNKNOWN
};

/* We want to call by this prototype, even if it is not the real  
one */

extern sched_setaffinity(int pid, unsigned int len, void *mask);

int probe_setaffinity(void) {
 unsigned long mask[511];
 int rc;

 memset(mask, 0, sizeof(mask));
 mask[0] = 1;
 rc = sched_setaffinity(0, sizeof(mask), mask);

 if (rc >= 0) {
 /* Kernel truncates over-length masks -> successful call */
 return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET;
 } else if (errno == EINVAL) {
 /* Kernel returns EINVAL when len != sizeof(long) */
 return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG;
 } else if (errno == EFAULT) {
 /* Kernel returns EFAULT having rejected len as an address */
 return SCHED_SETAFFINITY_TAKES_2_ARGS;
 }
 return SCHED_SETAFFINITY_UNKNOWN;
};


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/





Re: [O-MPI devel] Linux processor affinity

2005-11-29 Thread Paul H. Hargrove

Eureka!

Operationally the 3-argument variants are ALMOST identical.  The older 
version required len == sizeof(long), while the later version allowed 
the len to vary (so an Altix could have more than 64 cpus).  However, in 
the kernel both effectively treat the 3rd argument as an array of 
unsigned longs.  It appears that with the later kernel interface, both 
"cpu_set_t*" and "unsigned long *" have been used by glibc.  So, as long 
as the kernel isn't enforcing len==sizeof(long), the cpu_set_t can be 
used w/ any 3-argument kernel regardless of what the library headers say.


Looking at the kernel code for various implementations of the 3-arg 
version shows that it can be tough to know which from a static test.  On 
the Altix (using cpu_set_t) one gets errno=EFAULT if the len is too 
short to cover all the online cpus, while on other kernels I find that a 
too-short mask is padded w/ zeros and no error results.  So, we want a 
big value for len.  However, since the 2-arg version treats the 2nd arg 
as an address rather than a len, we can use a len<4096 to ensure an 
invalid address will result in errno=EFAULT.


The result is the following, which I've tried in limited testing:


enum {
  SCHED_SETAFFINITY_TAKES_2_ARGS,
  SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG,
  SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET,
  SCHED_SETAFFINITY_UNKNOWN
};

/* We want to call by this prototype, even if it is not the real one */
extern sched_setaffinity(int pid, unsigned int len, void *mask);

int probe_setaffinity(void) {
unsigned long mask[511];
int rc;

memset(mask, 0, sizeof(mask));
mask[0] = 1;
rc = sched_setaffinity(0, sizeof(mask), mask);

if (rc >= 0) {
/* Kernel truncates over-length masks -> successful call */
return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET;
} else if (errno == EINVAL) {
/* Kernel returns EINVAL when len != sizeof(long) */
return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG;
} else if (errno == EFAULT) {
/* Kernel returns EFAULT having rejected len as an address */
return SCHED_SETAFFINITY_TAKES_2_ARGS;
}
return SCHED_SETAFFINITY_UNKNOWN;
};



Jeff Squyres wrote:
Greetings all.  I'm writing this to ask for help from the general 
development community.  We've run into a problem with Linux processor 
affinity, and although I've individually talked to a lot of people 
about this, no one has been able to come up with a solution.  So I 
thought I'd open this to a wider audience.


This is a long-ish e-mail; bear with me.

As you may or may not know, Open MPI includes support for processor and 
memory affinity.  There are a number of benefits, but I'll skip that 
discussion for now.  For more information, see the following:


http://www.open-mpi.org/faq/?category=building#build-paffinity
http://www.open-mpi.org/faq/?category=building#build-maffinity
http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
http://www.open-mpi.org/faq/?category=tuning#maffinity-defs
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Here's the problem: there are 3 different APIs for processor affinity 
in Linux.  I have not done exhaustive research on this, but which API 
you have seems to depend on your version of kernel, glibc, and/or Linux 
vendor (i.e., some vendors appear to port different versions of the API 
to their particular kernel/glibc).  The issue is that all 3 versions of 
the API use the same function names (sched_setaffinity() and 
sched_getaffinity()), but they change the number and types of the 
parameters to these functions.


This is not a big problem for source distributions of Open MPI -- our 
configure script figures out which one you have and uses preprocessor 
directives to select the Right stuff in our code base for your 
platform.


What *is* a big problem, however, is that ISVs can therefore not ship a 
binary Open MPI installation and reasonably expect the processor 
affinity aspects of it to work on multiple Linux platforms.  That is, 
if the ISV compiles for API #X and ships a binary to a system that has 
API #Y, there are two options:


1. Processor affinity is disabled.  This means that the benefits of 
processor affinity won't be visible (not hugely important on 2-way 
SMPs, but as the number of processors/cores increases, this is going to 
become more important), and Open MPI's NUMA-aware collectives won't be 
able to be used (because memory affinity may not be useful without 
processor affinity guarantees).


2. Processor affinity is enabled, but the code invokes API #X on a 
system with API #Y.  This will have unpredictable results, the best 
case of which will be that processor affinity is simply [effectively] 
ignored; the worst case of which will be that the application will fail 
(e.g., seg fault).


Clearly, neither of these solutions are attractive.

My question to the developer crowd out there -- can you think of a way 
around this?  More specifically, is 

Re: [O-MPI devel] Linux processor affinity

2005-11-29 Thread Paul H. Hargrove

Jeff, et al.,

  My own "research" into processor affinity for the GASNet runtime 
began by "borrowing" the related autoconf code from OpenMPI.  My 
experience is the same as Jeff's when it comes to looking for a 
correlation between the API and any system parameter such as libc or 
kernel version: not an exhaustive search, but enough to see that there 
is no simple mapping.
  While far from "ideal", one option might be to perform an 
installation-time probe w/ a dumbed down version of the autoconf probes 
used at build time.  This probe would then set the proper processor 
affinity setting in a config file, an env var in the ISV's wrapper 
around mpirun, or similar place.  One can then have processor affinity 
disabled if no setting is found and use the one selected at install time 
if the setting is found.


-Paul

Jeff Squyres wrote:
Greetings all.  I'm writing this to ask for help from the general 
development community.  We've run into a problem with Linux processor 
affinity, and although I've individually talked to a lot of people 
about this, no one has been able to come up with a solution.  So I 
thought I'd open this to a wider audience.


This is a long-ish e-mail; bear with me.

As you may or may not know, Open MPI includes support for processor and 
memory affinity.  There are a number of benefits, but I'll skip that 
discussion for now.  For more information, see the following:


http://www.open-mpi.org/faq/?category=building#build-paffinity
http://www.open-mpi.org/faq/?category=building#build-maffinity
http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
http://www.open-mpi.org/faq/?category=tuning#maffinity-defs
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

Here's the problem: there are 3 different APIs for processor affinity 
in Linux.  I have not done exhaustive research on this, but which API 
you have seems to depend on your version of kernel, glibc, and/or Linux 
vendor (i.e., some vendors appear to port different versions of the API 
to their particular kernel/glibc).  The issue is that all 3 versions of 
the API use the same function names (sched_setaffinity() and 
sched_getaffinity()), but they change the number and types of the 
parameters to these functions.


This is not a big problem for source distributions of Open MPI -- our 
configure script figures out which one you have and uses preprocessor 
directives to select the Right stuff in our code base for your 
platform.


What *is* a big problem, however, is that ISVs can therefore not ship a 
binary Open MPI installation and reasonably expect the processor 
affinity aspects of it to work on multiple Linux platforms.  That is, 
if the ISV compiles for API #X and ships a binary to a system that has 
API #Y, there are two options:


1. Processor affinity is disabled.  This means that the benefits of 
processor affinity won't be visible (not hugely important on 2-way 
SMPs, but as the number of processors/cores increases, this is going to 
become more important), and Open MPI's NUMA-aware collectives won't be 
able to be used (because memory affinity may not be useful without 
processor affinity guarantees).


2. Processor affinity is enabled, but the code invokes API #X on a 
system with API #Y.  This will have unpredictable results, the best 
case of which will be that processor affinity is simply [effectively] 
ignored; the worst case of which will be that the application will fail 
(e.g., seg fault).


Clearly, neither of these solutions are attractive.

My question to the developer crowd out there -- can you think of a way 
around this?  More specifically, is there a way to know -- at run time 
-- which API to use?  We can do some compiler trickery to compile all 
three APIs into a single Open MPI installation and then run-time 
dispatch to the Right one, but this is contingent upon being able to 
determine which API to dispatch to.  A bunch of us have poked around 
and not found anything on the system that indicates which API you have 
(e.g., looked in /proc and /sys), but not found anything.


Does anyone have any suggestions here?

Many thanks for your time.




--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900