Re: [O-MPI devel] Linux processor affinity
On Dec 13, 2005, at 1:45 PM, Jeff Squyres wrote: PLPA should be available Real Soon Now. We have released v0.9 of the Portable Linux Processor Affinity (PLPA -- pronounced "pli-pa") project, a standalone library that hides all the muckety-muck of processor affinity that we have been discussing on this list for the past few weeks. This version does not pretend to be a stable release yet -- although it's quite small and we've done a bunch of testing with it, it's now time to open this project up to the community for more wide-spread testing and real-world feedback. The PLPA has its own web pages and mailing lists -- let's move all discussion over to those lists. See the PLPA home page for more details: http://www.open-mpi.org/software/plpa/ -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] Linux processor affinity
On Dec 13, 2005, at 12:30 PM, Bogdan Costescu wrote: In short -- we ended up agreeing with you. :-) Whew, I'm surprised given the enthusiasm which you showed when seeing Paul's code ! Really, I thought that you will choose Paul's code with only the conditions changed as expressed in my last e-mail... as to give satisfaction to both of us :-) Nah -- my enthusiasm was more geared towards the fact that there *was* a solution available (I was pretty convinced that there was not -- my appeal to the mailing list was a last ditch effort). But I would like to stress the fact that my conclusions came out only from reading the kernel code; I only did some tests on the kernels that I have running at the moment (RHEL3 and RHEL4 on i386). So some real-world testing is still needed, especially on the kernels that were "different". Absolutely. Paul has looked at a bunch of kernels and verified our tests, and we've run them on a variety of machines and it seems to be working. PLPA should be available Real Soon Now. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] Linux processor affinity
On Mon, 12 Dec 2005, Jeff Squyres wrote: In short -- we ended up agreeing with you. :-) Whew, I'm surprised given the enthusiasm which you showed when seeing Paul's code ! Really, I thought that you will choose Paul's code with only the conditions changed as expressed in my last e-mail... as to give satisfaction to both of us :-) But I would like to stress the fact that my conclusions came out only from reading the kernel code; I only did some tests on the kernels that I have running at the moment (RHEL3 and RHEL4 on i386). So some real-world testing is still needed, especially on the kernels that were "different". As a personal sidenote -- it sucks that we all had to do this much research to figure this out. Well, I consider this to be a result of applying commercial interest to Linux. The distributions should not have backported those changes until a final, stable API was established, but when you want to be the first to claim "I support this and that"... Thanks for your diligence in pestering us about this! :-) Eh, don't mention it! I want Open MPI to work :-) -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [O-MPI devel] Linux processor affinity
To make this significantly easier, I called Paul and we discussed this at length. In short -- we ended up agreeing with you. :-) As a personal sidenote -- it sucks that we all had to do this much research to figure this out. In particular, we missed the fact that all the kernel versions take 3 arguments (we thought that some took 2), and that's where some of the reasons for the initial approach came from. So we'll implement this as a syscall() and use the getaffinity syscall to probe for the correct length (some kernels require <= sizeof(long), some require == sizeof(long), and some are ok with >= sizeof(long)). Using syscall() cuts out the potentially-buggy middleman (glibc), and removes a layer of indirection that is *usually* able to be deduced, but there's little reason not to use syscall directly. There are some older systems out there that do not have syscall(), but I don't think we care about them (i.e., we can check for that in configure). Plus, those systems won't have processor affinity, anyway. Behind the scenes, Paul and I have been working on a standalone library to handle all this junk called Portable Linux Processor Affinity (PLPA). The SVN is hosted on svn.open-mpi.org -- we'll open it up in a few days (i.e., after we adjust to the syscall() interface). This library will be released under the BSD license and a) is really pretty small, b) but most importantly, allows other developers using Linux processor affinity to not worry about any of these horrid details. The PLPA will have its own web page and mailing list, too. Thanks for your diligence in pestering us about this! :-) On Dec 12, 2005, at 10:32 AM, Bogdan Costescu wrote: On Fri, 9 Dec 2005, Paul H. Hargrove wrote: If one looks though enough kernel versions, In the meantime, I've gotten a copy of kernel/sched.c from a SGI Prism kernel - I assume that it is the same used on Altix; this one has in the Makefile EXTRAVERSION = -sgi306rp31. So again, all prototypes of the sys_sched_setaffinity function that I've seen so far have 3 args... which means that no compiler tricks are needed to keep 3 different copies of the function. one finds that some of them differ in what they will accept for the len. OK, so this is a different problem... Some produce EINVAL if len!=sizeof(long), I beg to disagree. All the codes that I looked at test for len < sizeof(new_mask) and copy user data based on the size of new_mask, so if "len" is larger than sizeof(new_mask), no error occurs. others (especially Altix) produce EINVAL if len is too short to cover all the machine's CPUs. ...so IMHO this test should be used instead to separate a long from a (larger) cpumask_t. In the message that described your implementation you also wrote: while on other kernels I find that a too-short mask is padded w/ zeros and no error results. So, we want a big value for len Indeed some (more recent) kernels pad with zeros if "len" is too short. But a "big value for len" is again wrong. I can see 4 cases, again by looking at the kernel code and not dealing with 2 vs. 3 args: 1. tests for len < sizeof(long) and copies only sizeof(len) if larger (backported 2.4 in RHEL3); this can be identified by passing "len" smaller than sizeof(long) which returns -EINVAL and then passing "len" of (or larger than) sizeof(long) which should not return error. 2. tests for len < sizeof(cpumask_t) and copies only sizeof(len) if larger (backported 2.4 from SGI, 2.6.3 from Mandrake 10.0); this can be identified by passing "len" shorter than sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or larger than) sizeof(cpu_size_t) which should not return error. 3. tests for len < sizeof(cpumask_t) and pads with zeros if true, otherwise copies only sizeof(cpumask_t) (2.6.9 in RHEL4 and 2.6.14). This can't really be identified as it doesn't return -EINVAL in any situation. As you can see your suggestion to set "big value for len" would successfully pass _all_ of the above conditions and would therefore not offer any separation between the cases. The stuff above applies to the _set function; the _get function is a bit different: 1. tests for len < sizeof(long) and returns -EINVAL if true. (backported 2.4 in RHEL3). This can be identified by passing "len" smaller than sizeof(long) which returns -EINVAL and then passing "len" of (or larger than) sizeof(long) which should not return error. 2. tests for len < sizeof(cpumask_t) and returns -EINVAL if true. (backported 2.4 from SGI, 2.6.3 from Mandraks 10.0, 2.6.9 from RHEL4, 2.6.14). This can be identified by passing "len" smaller than sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or larger than) sizeof(cpumask_t) which should not return error. Case 1. of _set is associated to case 1. of _get. Cases 2. and 3. of _set are both associated to case 2. of _get. So IMHO the test should be made with the _get function (as explained in a pre
Re: [O-MPI devel] Linux processor affinity
On Fri, 9 Dec 2005, Paul H. Hargrove wrote: If one looks though enough kernel versions, In the meantime, I've gotten a copy of kernel/sched.c from a SGI Prism kernel - I assume that it is the same used on Altix; this one has in the Makefile EXTRAVERSION = -sgi306rp31. So again, all prototypes of the sys_sched_setaffinity function that I've seen so far have 3 args... which means that no compiler tricks are needed to keep 3 different copies of the function. one finds that some of them differ in what they will accept for the len. OK, so this is a different problem... Some produce EINVAL if len!=sizeof(long), I beg to disagree. All the codes that I looked at test for len < sizeof(new_mask) and copy user data based on the size of new_mask, so if "len" is larger than sizeof(new_mask), no error occurs. others (especially Altix) produce EINVAL if len is too short to cover all the machine's CPUs. ...so IMHO this test should be used instead to separate a long from a (larger) cpumask_t. In the message that described your implementation you also wrote: while on other kernels I find that a too-short mask is padded w/ zeros and no error results. So, we want a big value for len Indeed some (more recent) kernels pad with zeros if "len" is too short. But a "big value for len" is again wrong. I can see 4 cases, again by looking at the kernel code and not dealing with 2 vs. 3 args: 1. tests for len < sizeof(long) and copies only sizeof(len) if larger (backported 2.4 in RHEL3); this can be identified by passing "len" smaller than sizeof(long) which returns -EINVAL and then passing "len" of (or larger than) sizeof(long) which should not return error. 2. tests for len < sizeof(cpumask_t) and copies only sizeof(len) if larger (backported 2.4 from SGI, 2.6.3 from Mandrake 10.0); this can be identified by passing "len" shorter than sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or larger than) sizeof(cpu_size_t) which should not return error. 3. tests for len < sizeof(cpumask_t) and pads with zeros if true, otherwise copies only sizeof(cpumask_t) (2.6.9 in RHEL4 and 2.6.14). This can't really be identified as it doesn't return -EINVAL in any situation. As you can see your suggestion to set "big value for len" would successfully pass _all_ of the above conditions and would therefore not offer any separation between the cases. The stuff above applies to the _set function; the _get function is a bit different: 1. tests for len < sizeof(long) and returns -EINVAL if true. (backported 2.4 in RHEL3). This can be identified by passing "len" smaller than sizeof(long) which returns -EINVAL and then passing "len" of (or larger than) sizeof(long) which should not return error. 2. tests for len < sizeof(cpumask_t) and returns -EINVAL if true. (backported 2.4 from SGI, 2.6.3 from Mandraks 10.0, 2.6.9 from RHEL4, 2.6.14). This can be identified by passing "len" smaller than sizeof(cpumask_t) which returns -EINVAL and then passing "len" of (or larger than) sizeof(cpumask_t) which should not return error. Case 1. of _set is associated to case 1. of _get. Cases 2. and 3. of _set are both associated to case 2. of _get. So IMHO the test should be made with the _get function (as explained in a previous message), by setting len=sizeof(long) which would allow the case 1. to work fine, while case 2. would return -EINVAL, exactly opposite from the code that you proposed. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [O-MPI devel] Linux processor affinity
Just recently finished checking. For the collection of Linux hosts I have access to, the probe results are the same regardless of the choice of set or get. I agree 100% that "get" is a safer probe. -Paul Jeff Squyres wrote: On Dec 9, 2005, at 3:06 PM, Bogdan Costescu wrote: rc = sched_setaffinity(0, sizeof(mask), mask); This changes whatever affinity might have been set before this check, for example by a (smart, don't know if such exists now) batch system. I haven't checked if it's possible, but I think that a similar solution based on sched_getaffinity would be much better, as this would not disturb the current settings. Paul and I were discussing this earlier (off list). He was investigating doing the same check with sched_getaffinity() -- I don't know if he has finished checking into that already. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [O-MPI devel] Linux processor affinity
On Dec 9, 2005, at 3:06 PM, Bogdan Costescu wrote: rc = sched_setaffinity(0, sizeof(mask), mask); This changes whatever affinity might have been set before this check, for example by a (smart, don't know if such exists now) batch system. I haven't checked if it's possible, but I think that a similar solution based on sched_getaffinity would be much better, as this would not disturb the current settings. Paul and I were discussing this earlier (off list). He was investigating doing the same check with sched_getaffinity() -- I don't know if he has finished checking into that already. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] Linux processor affinity
On Thu, 8 Dec 2005, Jeff Squyres wrote: This is friggen' amazing. Let me disagree with you here... and not because I proposed a different solution. ;-) rc = sched_setaffinity(0, sizeof(mask), mask); This changes whatever affinity might have been set before this check, for example by a (smart, don't know if such exists now) batch system. I haven't checked if it's possible, but I think that a similar solution based on sched_getaffinity would be much better, as this would not disturb the current settings. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [O-MPI devel] Linux processor affinity
If one looks though enough kernel versions, one finds that some of them differ in what they will accept for the len. Some produce EINVAL if len!=sizeof(long), others (especially Altix) produce EINVAL if len is too short to cover all the machine's CPUs. I think I recall finding one that was even happy w/ len==0. So, even if one ignores the 2-argument version in some 2.5.x kernels, the caller needs to know if the len to pass should always be sizeof(long), or if it should reflect the true number of CPUs present (as one must on an Altix). -Paul Bogdan Costescu wrote: On Thu, 8 Dec 2005, Jeff Squyres wrote: Check out http://svn.open-mpi.org/svn/ompi/trunk/opal/mca/paffinity/ linux/paffinity_linux.h -- there's a big comment in that file about the problem, to include descriptions of the 3 APIs. I'm sorry, but that is not quite what I wrote about in my message. The comments refer to the _glibc_ view of the functions, at least I couldn't see how they map to my reading of the _kernel_ source code. Let's take one that is specifically mentioned there: Mandrake 10.0, kernel based on 2.6.3, in file kernel/sched.c there is the function: /** * sys_sched_setaffinity - set the cpu affinity of a process * @pid: pid of the process * @len: length in bytes of the bitmask pointed to by user_mask_ptr * @user_mask_ptr: user-space pointer to the new cpu mask */ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len, unsigned long __user *user_mask_ptr) which again has 3 arguments that look exactly like the ones that I mentioned previously. I don't have access to the source code of the SGI Altix kernel, so I can't check the other one mentioned there as a 2-args function. But so far all _kernel_ prototypes of the function that I have looked at are exactly the same with 3 arguments. The solution that I proposed works much like a statically linked binary - it calls via a syscall the _kernel_ function that has a constant (so far) prototype. It doesn't call the _glibc_ function that changes prototype. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [O-MPI devel] Linux processor affinity
On Thu, 8 Dec 2005, Jeff Squyres wrote: Check out http://svn.open-mpi.org/svn/ompi/trunk/opal/mca/paffinity/ linux/paffinity_linux.h -- there's a big comment in that file about the problem, to include descriptions of the 3 APIs. I'm sorry, but that is not quite what I wrote about in my message. The comments refer to the _glibc_ view of the functions, at least I couldn't see how they map to my reading of the _kernel_ source code. Let's take one that is specifically mentioned there: Mandrake 10.0, kernel based on 2.6.3, in file kernel/sched.c there is the function: /** * sys_sched_setaffinity - set the cpu affinity of a process * @pid: pid of the process * @len: length in bytes of the bitmask pointed to by user_mask_ptr * @user_mask_ptr: user-space pointer to the new cpu mask */ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len, unsigned long __user *user_mask_ptr) which again has 3 arguments that look exactly like the ones that I mentioned previously. I don't have access to the source code of the SGI Altix kernel, so I can't check the other one mentioned there as a 2-args function. But so far all _kernel_ prototypes of the function that I have looked at are exactly the same with 3 arguments. The solution that I proposed works much like a statically linked binary - it calls via a syscall the _kernel_ function that has a constant (so far) prototype. It doesn't call the _glibc_ function that changes prototype. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [O-MPI devel] Linux processor affinity
On Nov 29, 2005, at 3:04 PM, Bogdan Costescu wrote: Here's the problem: there are 3 different APIs for processor affinity in Linux. Could you please list them (at least the ones that you know about) ? Check out http://svn.open-mpi.org/svn/ompi/trunk/opal/mca/paffinity/ linux/paffinity_linux.h -- there's a big comment in that file about the problem, to include descriptions of the 3 APIs. In the kernel source, in kernel/sched.c, the sys_sched_setaffinity function appears only in 2.6.0 (talking about stable kernels only). I can also see it back-ported by Red Hat in their RHEL3 (2.4.21-based) kernels, so I would like to know if others have back-ported it as well and if their functions differ. Both the official 2.6.x and the Red Hat back-ported definition of this function is: asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len, unsigned long __user *user_mask_ptr) (the back-ported RHEL3 doesn't have the __user attribute to the last parameter, but that's cosmetic) The glibc definitions of sched_setaffinity seem to change, I already found 2 of them in RHEL3 and RHEL4, but they both call the same underlying kernel function. So Open MPI could just bypass glibc and call the kernel function directly, for example: The problem is that there are some 2-parameter variants out there. :-( Check out Paul Hargrove's solution: http://www.open-mpi.org/community/ lists/devel/2005/11/0562.php -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] Linux processor affinity
On Nov 29, 2005, at 2:51 PM, Paul H. Hargrove wrote: The result is the following, which I've tried in limited testing: Holy Crimminey, Batman -- this message slipped by me in my INBOX. This is friggen' amazing. Many thanks, Paul! enum { SCHED_SETAFFINITY_TAKES_2_ARGS, SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG, SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET, SCHED_SETAFFINITY_UNKNOWN }; /* We want to call by this prototype, even if it is not the real one */ extern sched_setaffinity(int pid, unsigned int len, void *mask); int probe_setaffinity(void) { unsigned long mask[511]; int rc; memset(mask, 0, sizeof(mask)); mask[0] = 1; rc = sched_setaffinity(0, sizeof(mask), mask); if (rc >= 0) { /* Kernel truncates over-length masks -> successful call */ return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET; } else if (errno == EINVAL) { /* Kernel returns EINVAL when len != sizeof(long) */ return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG; } else if (errno == EFAULT) { /* Kernel returns EFAULT having rejected len as an address */ return SCHED_SETAFFINITY_TAKES_2_ARGS; } return SCHED_SETAFFINITY_UNKNOWN; }; -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] Linux processor affinity
On Tue, 29 Nov 2005, Jeff Squyres wrote: Here's the problem: there are 3 different APIs for processor affinity in Linux. Could you please list them (at least the ones that you know about) ? In the kernel source, in kernel/sched.c, the sys_sched_setaffinity function appears only in 2.6.0 (talking about stable kernels only). I can also see it back-ported by Red Hat in their RHEL3 (2.4.21-based) kernels, so I would like to know if others have back-ported it as well and if their functions differ. Both the official 2.6.x and the Red Hat back-ported definition of this function is: asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len, unsigned long __user *user_mask_ptr) (the back-ported RHEL3 doesn't have the __user attribute to the last parameter, but that's cosmetic) The glibc definitions of sched_setaffinity seem to change, I already found 2 of them in RHEL3 and RHEL4, but they both call the same underlying kernel function. So Open MPI could just bypass glibc and call the kernel function directly, for example: #include #include #include #include _syscall3(int, sched_setaffinity, pid_t, pid, unsigned int, len, unsigned long *, user_mask_ptr) int main(int argc, char **argv){ unsigned long cpus = 1; int r; r = sched_setaffinity(0, sizeof(cpus), &cpus); if (r == -1) { perror("sched_setaffinity:"); } return 0; } -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [O-MPI devel] Linux processor affinity
Eureka! Operationally the 3-argument variants are ALMOST identical. The older version required len == sizeof(long), while the later version allowed the len to vary (so an Altix could have more than 64 cpus). However, in the kernel both effectively treat the 3rd argument as an array of unsigned longs. It appears that with the later kernel interface, both "cpu_set_t*" and "unsigned long *" have been used by glibc. So, as long as the kernel isn't enforcing len==sizeof(long), the cpu_set_t can be used w/ any 3-argument kernel regardless of what the library headers say. Looking at the kernel code for various implementations of the 3-arg version shows that it can be tough to know which from a static test. On the Altix (using cpu_set_t) one gets errno=EFAULT if the len is too short to cover all the online cpus, while on other kernels I find that a too-short mask is padded w/ zeros and no error results. So, we want a big value for len. However, since the 2-arg version treats the 2nd arg as an address rather than a len, we can use a len<4096 to ensure an invalid address will result in errno=EFAULT. The result is the following, which I've tried in limited testing: enum { SCHED_SETAFFINITY_TAKES_2_ARGS, SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG, SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET, SCHED_SETAFFINITY_UNKNOWN }; /* We want to call by this prototype, even if it is not the real one */ extern sched_setaffinity(int pid, unsigned int len, void *mask); int probe_setaffinity(void) { unsigned long mask[511]; int rc; memset(mask, 0, sizeof(mask)); mask[0] = 1; rc = sched_setaffinity(0, sizeof(mask), mask); if (rc >= 0) { /* Kernel truncates over-length masks -> successful call */ return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_CPU_SET; } else if (errno == EINVAL) { /* Kernel returns EINVAL when len != sizeof(long) */ return SCHED_SETAFFINITY_TAKES_3_ARGS_THIRD_IS_LONG; } else if (errno == EFAULT) { /* Kernel returns EFAULT having rejected len as an address */ return SCHED_SETAFFINITY_TAKES_2_ARGS; } return SCHED_SETAFFINITY_UNKNOWN; }; Jeff Squyres wrote: Greetings all. I'm writing this to ask for help from the general development community. We've run into a problem with Linux processor affinity, and although I've individually talked to a lot of people about this, no one has been able to come up with a solution. So I thought I'd open this to a wider audience. This is a long-ish e-mail; bear with me. As you may or may not know, Open MPI includes support for processor and memory affinity. There are a number of benefits, but I'll skip that discussion for now. For more information, see the following: http://www.open-mpi.org/faq/?category=building#build-paffinity http://www.open-mpi.org/faq/?category=building#build-maffinity http://www.open-mpi.org/faq/?category=tuning#paffinity-defs http://www.open-mpi.org/faq/?category=tuning#maffinity-defs http://www.open-mpi.org/faq/?category=tuning#using-paffinity Here's the problem: there are 3 different APIs for processor affinity in Linux. I have not done exhaustive research on this, but which API you have seems to depend on your version of kernel, glibc, and/or Linux vendor (i.e., some vendors appear to port different versions of the API to their particular kernel/glibc). The issue is that all 3 versions of the API use the same function names (sched_setaffinity() and sched_getaffinity()), but they change the number and types of the parameters to these functions. This is not a big problem for source distributions of Open MPI -- our configure script figures out which one you have and uses preprocessor directives to select the Right stuff in our code base for your platform. What *is* a big problem, however, is that ISVs can therefore not ship a binary Open MPI installation and reasonably expect the processor affinity aspects of it to work on multiple Linux platforms. That is, if the ISV compiles for API #X and ships a binary to a system that has API #Y, there are two options: 1. Processor affinity is disabled. This means that the benefits of processor affinity won't be visible (not hugely important on 2-way SMPs, but as the number of processors/cores increases, this is going to become more important), and Open MPI's NUMA-aware collectives won't be able to be used (because memory affinity may not be useful without processor affinity guarantees). 2. Processor affinity is enabled, but the code invokes API #X on a system with API #Y. This will have unpredictable results, the best case of which will be that processor affinity is simply [effectively] ignored; the worst case of which will be that the application will fail (e.g., seg fault). Clearly, neither of these solutions are attractive. My question to the developer crowd out there -- can you think of a way around this? More specifically, is ther
Re: [O-MPI devel] Linux processor affinity
Jeff, et al., My own "research" into processor affinity for the GASNet runtime began by "borrowing" the related autoconf code from OpenMPI. My experience is the same as Jeff's when it comes to looking for a correlation between the API and any system parameter such as libc or kernel version: not an exhaustive search, but enough to see that there is no simple mapping. While far from "ideal", one option might be to perform an installation-time probe w/ a dumbed down version of the autoconf probes used at build time. This probe would then set the proper processor affinity setting in a config file, an env var in the ISV's wrapper around mpirun, or similar place. One can then have processor affinity disabled if no setting is found and use the one selected at install time if the setting is found. -Paul Jeff Squyres wrote: Greetings all. I'm writing this to ask for help from the general development community. We've run into a problem with Linux processor affinity, and although I've individually talked to a lot of people about this, no one has been able to come up with a solution. So I thought I'd open this to a wider audience. This is a long-ish e-mail; bear with me. As you may or may not know, Open MPI includes support for processor and memory affinity. There are a number of benefits, but I'll skip that discussion for now. For more information, see the following: http://www.open-mpi.org/faq/?category=building#build-paffinity http://www.open-mpi.org/faq/?category=building#build-maffinity http://www.open-mpi.org/faq/?category=tuning#paffinity-defs http://www.open-mpi.org/faq/?category=tuning#maffinity-defs http://www.open-mpi.org/faq/?category=tuning#using-paffinity Here's the problem: there are 3 different APIs for processor affinity in Linux. I have not done exhaustive research on this, but which API you have seems to depend on your version of kernel, glibc, and/or Linux vendor (i.e., some vendors appear to port different versions of the API to their particular kernel/glibc). The issue is that all 3 versions of the API use the same function names (sched_setaffinity() and sched_getaffinity()), but they change the number and types of the parameters to these functions. This is not a big problem for source distributions of Open MPI -- our configure script figures out which one you have and uses preprocessor directives to select the Right stuff in our code base for your platform. What *is* a big problem, however, is that ISVs can therefore not ship a binary Open MPI installation and reasonably expect the processor affinity aspects of it to work on multiple Linux platforms. That is, if the ISV compiles for API #X and ships a binary to a system that has API #Y, there are two options: 1. Processor affinity is disabled. This means that the benefits of processor affinity won't be visible (not hugely important on 2-way SMPs, but as the number of processors/cores increases, this is going to become more important), and Open MPI's NUMA-aware collectives won't be able to be used (because memory affinity may not be useful without processor affinity guarantees). 2. Processor affinity is enabled, but the code invokes API #X on a system with API #Y. This will have unpredictable results, the best case of which will be that processor affinity is simply [effectively] ignored; the worst case of which will be that the application will fail (e.g., seg fault). Clearly, neither of these solutions are attractive. My question to the developer crowd out there -- can you think of a way around this? More specifically, is there a way to know -- at run time -- which API to use? We can do some compiler trickery to compile all three APIs into a single Open MPI installation and then run-time dispatch to the Right one, but this is contingent upon being able to determine which API to dispatch to. A bunch of us have poked around and not found anything on the system that indicates which API you have (e.g., looked in /proc and /sys), but not found anything. Does anyone have any suggestions here? Many thanks for your time. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[O-MPI devel] Linux processor affinity
Greetings all. I'm writing this to ask for help from the general development community. We've run into a problem with Linux processor affinity, and although I've individually talked to a lot of people about this, no one has been able to come up with a solution. So I thought I'd open this to a wider audience. This is a long-ish e-mail; bear with me. As you may or may not know, Open MPI includes support for processor and memory affinity. There are a number of benefits, but I'll skip that discussion for now. For more information, see the following: http://www.open-mpi.org/faq/?category=building#build-paffinity http://www.open-mpi.org/faq/?category=building#build-maffinity http://www.open-mpi.org/faq/?category=tuning#paffinity-defs http://www.open-mpi.org/faq/?category=tuning#maffinity-defs http://www.open-mpi.org/faq/?category=tuning#using-paffinity Here's the problem: there are 3 different APIs for processor affinity in Linux. I have not done exhaustive research on this, but which API you have seems to depend on your version of kernel, glibc, and/or Linux vendor (i.e., some vendors appear to port different versions of the API to their particular kernel/glibc). The issue is that all 3 versions of the API use the same function names (sched_setaffinity() and sched_getaffinity()), but they change the number and types of the parameters to these functions. This is not a big problem for source distributions of Open MPI -- our configure script figures out which one you have and uses preprocessor directives to select the Right stuff in our code base for your platform. What *is* a big problem, however, is that ISVs can therefore not ship a binary Open MPI installation and reasonably expect the processor affinity aspects of it to work on multiple Linux platforms. That is, if the ISV compiles for API #X and ships a binary to a system that has API #Y, there are two options: 1. Processor affinity is disabled. This means that the benefits of processor affinity won't be visible (not hugely important on 2-way SMPs, but as the number of processors/cores increases, this is going to become more important), and Open MPI's NUMA-aware collectives won't be able to be used (because memory affinity may not be useful without processor affinity guarantees). 2. Processor affinity is enabled, but the code invokes API #X on a system with API #Y. This will have unpredictable results, the best case of which will be that processor affinity is simply [effectively] ignored; the worst case of which will be that the application will fail (e.g., seg fault). Clearly, neither of these solutions are attractive. My question to the developer crowd out there -- can you think of a way around this? More specifically, is there a way to know -- at run time -- which API to use? We can do some compiler trickery to compile all three APIs into a single Open MPI installation and then run-time dispatch to the Right one, but this is contingent upon being able to determine which API to dispatch to. A bunch of us have poked around and not found anything on the system that indicates which API you have (e.g., looked in /proc and /sys), but not found anything. Does anyone have any suggestions here? Many thanks for your time. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/