[OMPI devel] RFC: make hwloc first-class data
WHAT: Make hwloc a 1st class item in OMPI WHY: At least 2 pieces of new functionality want/need to use the hwloc data WHERE: Put it in ompi/hwloc WHEN: Some time in the 1.5 series TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now) A long time ago, I floated the proposal of putting hwloc at the top level in opal so that parts of OPAL/ORTE/OMPI could use the data directly. I didn't have any concrete suggestions at the time about what exactly would use the hwloc data -- just a feeling that "someone" would want to. There are now two solid examples of functionality that want to use hwloc data directly: 1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right ones, but you get the idea). That is, pre-defined communicators that contain all the MPI procs on the same socket as you, the same NUMA node as you, the same core as you, ...etc. 2. INRIA presented a paper at Euro MPI last week that takes process distance to NICs into account when coming up with the long-message splitting ratio for the PML. E.g., if we have 2 openib NICs with the same bandwidth, don't just assume that we'll split long messages 50-50 across both of them. Instead, use NUMA distances to influence calculating the ratio. See the paper here: http://hal.archives-ouvertes.fr/inria-00486178/en/ A previous objection was that we are increasing our dependencies by making hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of business. Fair enough. But that being said, hwloc is getting a bit of a community growing around it: vendors are submitting patches for their hardware, distros are picking it up, etc. I certainly can't predict the future, but hwloc looks in good shape for now. There is a little risk in depending on hwloc, but I think it's small enough to be ok. Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for embedded environments where hwloc simply takes up space and adds no value). I previously proposed wrapping a subset of the hwloc API with opal_*() functions. After thinking about that a bit, that seems like a lot of work for little benefit -- how does one decide *which* subset of hwloc should be wrapped? Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead of opal/hwloc). Indeed, the 2 places that want to use hwloc are up in the MPI layer -- I'm guessing that most functionality that wants hwloc will be up in MPI. And if we do the build system right, we can have paffinity/hwloc and libmpi's hwloc all link against the same libhwloc_embedded so that: a) there's no duplication in the process, and b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid having hwloc in OPAL/ORTE for embedded environments (there's a little hand-waving there, but I think we can figure out the details) We *may* want to refactor paffinity and maffinity someday, but that's not necessarily what I'm proposing here. Comments? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[hwloc-devel] Create success (hwloc r1.1a1r2491)
Creating nightly hwloc snapshot SVN tarball was a success. Snapshot: hwloc 1.1a1r2491 Start time: Wed Sep 22 21:01:04 EDT 2010 End time: Wed Sep 22 21:03:14 EDT 2010 Your friendly daemon, Cyrador
Re: [OMPI devel] Setting AUTOMAKE_JOBS
On Sep 22, 2010, at 4:51 PM, Ralf Wildenhues wrote: > Thanks for the measurements! I'm a bit surprised that the speedup is > not higher. Do you have timings as to how much of the autogen.pl time > is spent inside automake? No, they didn't. I re-ran them to just time autoreconf (is there a way to extract *just* the time spent in automake in there?). Here's what I got: 13:57.19 22:43.82 42:13.68 82:13.47 -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Setting AUTOMAKE_JOBS
Hi Jeff, adding bug-automake in Cc: (non-subscribers can't post to the Open MPI list, so please remove that Cc: in case) * Jeff Squyres wrote on Wed, Sep 22, 2010 at 03:50:19PM CEST: > $AUTOMAKE_JOBS Total wall time > valueof autogen.pl > 83:01.46 > 42:55.57 > 23:28.09 > 14:38.44 > > This is an older Xeon machine with 2 sockets, each with 2 cores. Thanks for the measurements! I'm a bit surprised that the speedup is not higher. Do you have timings as to how much of the autogen.pl time is spent inside automake? IIRC the pure automake part for OpenMPI would speed up better on bigger systems, my old numbers from two years ago are here: http://lists.gnu.org/archive/html/automake-patches/2008-10/msg00055.html Cheers, Ralf
Re: [OMPI devel] How to add a schedule algorithm to the pml
Thank you very much. Ken -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Wednesday, September 22, 2010 10:09 AM To: Open MPI Developers Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml I see it here: http://hal.archives-ouvertes.fr/inria-00486178/en/ On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote: > Jeff, > > Is that EuroMPI2010 ob1 paper publicly available? I get involved in various > NUMA partitioning/architecting studies and it seems there is not a lot of > discussion in this area. > > Ken Lloyd > > == > Kenneth A. Lloyd > Watt Systems Technologies Inc. > > > > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: Wednesday, September 22, 2010 6:00 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml > > Sorry for the delay in replying -- I was in Europe for the past two weeks; > travel always makes me wy behind on my INBOX... > > > On Sep 14, 2010, at 9:56 PM, 张晶 wrote: > >> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I >> can only find a paper named "Open MPI: A Flexible High Performance MPI" >> and some annotation in the source file. From them , I know ob1 has >> implemented round-robin& weighted distribution algorithm. But after >> tracking the MPI_Send(),I cann't figure out >> the location of these implement ,let alone to add a new schedule algorithm. >> I have two questions : >> 1.The location of the schedule algorithm ? > > It's complicated -- I'd say that the PML is probably among the most > complicated sections of Open MPI because it is the main "engine" that > enforces the MPI point-to-point semantics. The algorithm is fairly well > distribute throughout the PML source code. :-\ > >> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . >> The function of these components? > > cm: this component drives the MTL point-to-point components. It is mainly a > thin wrapper for network transports that provide their own MPI-like matching > semantics. Hence, most of the MPI semantics are effectively done in the > lower layer (i.e., in the MTL components and their dependent libraries). You > probably won't be able to do much here, because such transports (MX, Portals, > etc.) do most of their semantics in the network layer -- not in Open MPI. If > you have a matching network layer, this is the PML that you probably use (MX, > Portals, PSM). > > crcpw: this is a fork of the ob1 PML; it add some failover semantics. > > csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so > you can tell if the underlying transport had an error). > > v: this PML uses logging and replay to effect some level of fault tolerance. > It's a distant fork of the ob1 PML, but has quite a few significant > differences. > > ob1: this is the "main" PML that most users use (TCP, shared memory, > OpenFabrics, etc.). It gangs together one or more BTLs to send/receive > messages across individual network transports. Hence, it supports true > multi-device/multi-rail algorithms. The BML (BTL multiplexing layer) is a > thin management later that marshals all the BTLs in the process together -- > it's mainly array handling, etc. The ob1 PML is the one that decides > multi-rail/device splitting, etc. The INRIA folks just published a paper > last week at Euro MPI about adjusting the ob1 scheduling algorithm to also > take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth > calculations. > > Hope this helps! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] How to add a schedule algorithm to the pml
I see it here: http://hal.archives-ouvertes.fr/inria-00486178/en/ On Sep 22, 2010, at 11:53 AM, Kenneth Lloyd wrote: > Jeff, > > Is that EuroMPI2010 ob1 paper publicly available? I get involved in various > NUMA partitioning/architecting studies and it seems there is not a lot of > discussion in this area. > > Ken Lloyd > > == > Kenneth A. Lloyd > Watt Systems Technologies Inc. > > > > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: Wednesday, September 22, 2010 6:00 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] How to add a schedule algorithm to the pml > > Sorry for the delay in replying -- I was in Europe for the past two weeks; > travel always makes me wy behind on my INBOX... > > > On Sep 14, 2010, at 9:56 PM, 张晶 wrote: > >> I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I >> can only find a paper named "Open MPI: A Flexible High Performance MPI" >> and some annotation in the source file. From them , I know ob1 has >> implemented round-robin& weighted distribution algorithm. But after >> tracking the MPI_Send(),I cann't figure out >> the location of these implement ,let alone to add a new schedule algorithm. >> I have two questions : >> 1.The location of the schedule algorithm ? > > It's complicated -- I'd say that the PML is probably among the most > complicated sections of Open MPI because it is the main "engine" that > enforces the MPI point-to-point semantics. The algorithm is fairly well > distribute throughout the PML source code. :-\ > >> 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . >> The function of these components? > > cm: this component drives the MTL point-to-point components. It is mainly a > thin wrapper for network transports that provide their own MPI-like matching > semantics. Hence, most of the MPI semantics are effectively done in the > lower layer (i.e., in the MTL components and their dependent libraries). You > probably won't be able to do much here, because such transports (MX, Portals, > etc.) do most of their semantics in the network layer -- not in Open MPI. If > you have a matching network layer, this is the PML that you probably use (MX, > Portals, PSM). > > crcpw: this is a fork of the ob1 PML; it add some failover semantics. > > csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so > you can tell if the underlying transport had an error). > > v: this PML uses logging and replay to effect some level of fault tolerance. > It's a distant fork of the ob1 PML, but has quite a few significant > differences. > > ob1: this is the "main" PML that most users use (TCP, shared memory, > OpenFabrics, etc.). It gangs together one or more BTLs to send/receive > messages across individual network transports. Hence, it supports true > multi-device/multi-rail algorithms. The BML (BTL multiplexing layer) is a > thin management later that marshals all the BTLs in the process together -- > it's mainly array handling, etc. The ob1 PML is the one that decides > multi-rail/device splitting, etc. The INRIA folks just published a paper > last week at Euro MPI about adjusting the ob1 scheduling algorithm to also > take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth > calculations. > > Hope this helps! > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [hwloc-devel] roadmap
Jeff Squyres, le Wed 22 Sep 2010 13:37:12 +0200, a écrit : > I think we should support memory binding, even if it does weird things -- > i.e., dropping membinding support on a given OS shouldn't be an option. That's why I'd tend to keep set_cpubind and set_membind, warning that one may have impact on the other, providing a flag for those who really care, and a binding guideline for normal users. > And/or have an "atomic"-like function that sets the memory binding and > returns the process memory binding? I'm not sure to understand what this means. > It would be good to put a sunset date or version on when hwloc_cpuset_foo > will expire (e.g., 6 months from now or two major revisions form now [1.3] -- > whichever comes last...?). Ok. > I'd also prefer a typedef than a #define for types (vs. a #define). Sure. Samuel
Re: [hwloc-devel] roadmap
Le 22/09/2010 16:30, Jeff Squyres a écrit : > On Sep 22, 2010, at 8:09 AM, Brice Goglin wrote: > > >> hwloc_set_*? hwloc_objset* ? Anything better? >> >> hwloc_set_* might not be the best since we would have a hwloc_set_set() >> function to set one bit :) >> > Agreed. Too bad, though -- I liked hwloc_set*. > > hwloc_group* (that seems kinda lame, though) > hwloc_stuff* (hah) > hwloc_bitmap* > > ? > bitmap or bitmask would be acceptable to me. >> By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be >> renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe? >> > How about just hwloc_set() which takes a single position parameter? > "onlyone" can be implied. > In case you missed it: cpu() = zero() + set() and all_but_cpu() = fill() + clr() Maybe just drop these? Brice
Re: [hwloc-devel] roadmap
On Sep 22, 2010, at 8:09 AM, Brice Goglin wrote: > hwloc_set_*? hwloc_objset* ? Anything better? > > hwloc_set_* might not be the best since we would have a hwloc_set_set() > function to set one bit :) Agreed. Too bad, though -- I liked hwloc_set*. hwloc_group* (that seems kinda lame, though) hwloc_stuff* (hah) hwloc_bitmap* ? > By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be > renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe? How about just hwloc_set() which takes a single position parameter? "onlyone" can be implied. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [hwloc-devel] roadmap
Brice Goglin, le Wed 22 Sep 2010 10:38:38 +0200, a écrit : > * Some OS bind the process too when you bind memory. Not for all kinds of memory bindings. For now, nothing that has been commited does that, it's only the remaining TODOs. The bindings in question are policy binding, i.e. not binding some given area or explicitly allocating some given size. > + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS The length of the word tells me that won't be convenient :) > so that the user can explicitly refuse memory binding if it may break > process binding > + Drop hwloc_set_membind on these OSes and add a > hwloc_set_cpumembind() to bind both That's the solution I prefer most as it directly maps to existing OS practice > + Make both process and memory binding do nothing if the STRICT flag > is given. But I'd rather not play too much with this flag. Yes. We should not put too vague semantic on this. > + Drop support for memory binding on these OS. Not all support, just setting the policy. > + Drop these OS. Nope :) > * cpuset and nodeset structures are the same, they are both manipulated > with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and > hwloc_set_foo functions. With #define and aliases to not break API/ABIs. I'd say so. Samuel
Re: [OMPI devel] Question regarding recently common shared-memory component
On Sep 21, 2010, at 12:37 PM,wrote: > Like I said in my earlier response, I have never tried this option. So I ran > these tests on 1.4.2 now and apparently the behavior is same ie; the > checkpoint creation time increases when I enable shared memory componentL I don't have huge experience with the checkpoint/restart stuff, but this is probably not a surprising result because the checkpoint will now need to include the shared memory stuff. Are the checkpoint images larger? (at least: is one of them noticeably larger?) That might account for the checkpoint performance difference. > Is there any parameter that can be tuned to improve the performance? My understanding is that there are some inherent bottlenecks in checkpoint / restart, such as the time required to dump out all the process images to disk. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] New Romio for OpenMPI available in bitbucket
I just commited the very last modifications of ROMIO (mpich2-1.3rc1) into bitbucket. Pascal Jeff Squyres a écrit : On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote: In charge of ticket 1888 (see at https://svn.open-mpi.org/trac/ompi/ticket/1888) , I have put the resulting code in bitbucket at: http://bitbucket.org/devezep/new-romio-for-openmpi/ Sweet! The work in this repo consisted in refreshing ROMIO to a newer version: the one from the very last MPICH2 release (mpich2-1.3b1). Great! I saw there was another MPICH2 release, and I saw a ROMIO patch or three go by on the MPICH list recently. Do you expect there to be major differences between what you have and those changes? I don't have any parallel filesystems to test with, but if someone else in the community could confirm/verify at least one or two of the parallel filesystems supported in ROMIO, I think we should bring this stuff into the trunk soon. Testing: 1. runs fine except one minor error (see the explanation below) on various FS. 2. runs fine with Lustre, but: . had to add a small patch in romio/adio/ad_lustre_open.c Did this patch get pushed upstream? The minor error === The test error.c fails because OpenMPI does not handle correctly the "two level" error functions of ROMIO: error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_ARG, "**iobaddisp", 0); OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp". Do you mean that we should be returning an error string "**iobaddisp" instead of "MPI_ERR_ARG"?
[OMPI devel] Setting AUTOMAKE_JOBS
Some of you may be unaware that recent versions of automake can run in parallel. That is, automake will run in parallel with a degree of (at most) $AUTOMAKE_JOBS. This can speed up the execution time of autogen.pl quite a bit on some platforms. On my cluster at cisco, here's a few quick timings of the entire autogen.pl process (of which, automake is the bottleneck): $AUTOMAKE_JOBS Total wall time valueof autogen.pl 83:01.46 42:55.57 23:28.09 14:38.44 This is an older Xeon machine with 2 sockets, each with 2 cores. There's a nice performance jump from 1 to 2, and a smaller jump from 2 to 4. 4 and 8 are close enough to not matter. YMMV. I just committed a heuristic to autogen.pl to setenv AUTOMAKE_JOBS if it is not already set (https://svn.open-mpi.org/trac/ompi/changeset/23788): - If lstopo is found in your $PATH, runs it and count how many PU's (processing units) you have. It'll set AUTOMAKE_JOBS to that number, or a maximum of 4 (which is admittedly a further heuristic). - If lstopo is not found, it just sets AUTOMAKE_JOBS to 2. Enjoy. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Barrier() after Finalize() when a file handle is leaked.
Thanks Lisandro! I filed https://svn.open-mpi.org/trac/ompi/ticket/2594 about this. On Sep 15, 2010, at 11:28 AM, Lisandro Dalcin wrote: > I've tested this with (--enable-debug --enable-picky > --enable-mem-debug) 1.4.2 and 1.5rc6. Despite being debug builds, a > mpi4py user got the same with (likely release) builds in both Ubuntu > and OS X. > > $ cat open.c > #include > int main(int argc, char *argv[]) { > MPI_File f; > MPI_Init(, ); > MPI_File_open(MPI_COMM_WORLD, "test.plt", MPI_MODE_RDONLY, MPI_INFO_NULL, > ); > /* MPI_File_close(); */ > MPI_Finalize(); > return 0; > } > > $ mpicc open.c > > $ ./a.out > *** The MPI_Barrier() function was called after MPI_FINALIZE was invoked. > *** This is disallowed by the MPI standard. > *** Your MPI job will now abort. > [trantor:15145] Abort after MPI_FINALIZE completed successfully; not > able to guarantee that all other processes were killed! > > > So if you open a file but never close it, a MPI_Barrier() gets called > after MPI_Finalize(). Could that come from a finalizer ROMIO callback? > However, I do not get this failure with MPICH2, and Open MPI seems to > behave just fine regarding MPI_Finalized(), the code below work as > expected: > > #include > #include > > static int atexitmpi(MPI_Comm comm, int k, void *v, void *xs) { > int flag; > MPI_Finalized(); > printf("atexitmpi: finalized=%d\n", flag); > MPI_Barrier(MPI_COMM_WORLD); > } > > int main(int argc, char *argv[]) { > int keyval = MPI_KEYVAL_INVALID; > MPI_Init(, ); > MPI_Comm_create_keyval(MPI_COMM_NULL_COPY_FN, atexitmpi, , 0); > MPI_Comm_set_attr(MPI_COMM_SELF, keyval, 0); > MPI_Finalize(); > return 0; > } > > > > -- > Lisandro Dalcin > --- > CIMEC (INTEC/CONICET-UNL) > Predio CONICET-Santa Fe > Colectora RN 168 Km 472, Paraje El Pozo > Tel: +54-342-4511594 (ext 1011) > Tel/Fax: +54-342-4511169 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] New Romio for OpenMPI available in bitbucket
Jeff Squyres a écrit : On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote: In charge of ticket 1888 (see at https://svn.open-mpi.org/trac/ompi/ticket/1888) , I have put the resulting code in bitbucket at: http://bitbucket.org/devezep/new-romio-for-openmpi/ Sweet! The work in this repo consisted in refreshing ROMIO to a newer version: the one from the very last MPICH2 release (mpich2-1.3b1). Great! I saw there was another MPICH2 release, and I saw a ROMIO patch or three go by on the MPICH list recently. Do you expect there to be major differences between what you have and those changes? I also see this new release (mpich2-1.3rc1). I am going to report the modifications and inform the list. I don't have any parallel filesystems to test with, but if someone else in the community could confirm/verify at least one or two of the parallel filesystems supported in ROMIO, I think we should bring this stuff into the trunk soon. Testing: 1. runs fine except one minor error (see the explanation below) on various FS. 2. runs fine with Lustre, but: . had to add a small patch in romio/adio/ad_lustre_open.c Did this patch get pushed upstream? This patch has been integrated yesterday in mpich2-1.3rc1 with another patch in romio/adio/common/lock.c. They will be available very soon in bitbucket. The minor error === The test error.c fails because OpenMPI does not handle correctly the "two level" error functions of ROMIO: error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, myname, __LINE__, MPI_ERR_ARG, "**iobaddisp", 0); OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp". Do you mean that we should be returning an error string "**iobaddisp" instead of "MPI_ERR_ARG"? In MPICH2, they have a file mpi/errhan/errnames.txt that will generate mpi/errhan/errnames.h making the links between codes like "**iobaddisp" and the corresponding error string "Invalid displacement argument". The error.c program tests the presence of "displacement" in the error string. With OpenMPI ,the error message is: " MPI_ERR_ARG: invalid argument of some other kind" With MPICH2 , the error message is : "Invalid argument, error stack: MPI_FILE_SET_VIEW(60): Invalid displacement argument" It would be better if OpenMPI displays at least the "Invalid displacement argument" message. This is not a new problem in OpenMPI, it was also the case in the trunk.
Re: [hwloc-devel] roadmap
Le 22/09/2010 13:36, Jeff Squyres a écrit : > On Sep 22, 2010, at 4:38 AM, Brice Goglin wrote: > > >> There are still some problems to solve in the membind branch: >> * Some OS bind the process too when you bind memory. I see the following >> solutions: >> + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS so that >> the user can explicitly refuse memory binding if it may break process >> binding >> + Drop hwloc_set_membind on these OSes and add a >> hwloc_set_cpumembind() to bind both >> + Make both process and memory binding do nothing if the STRICT flag >> is given. But I'd rather not play too much with this flag. >> + Drop support for memory binding on these OS. >> + Drop these OS. >> > What OS's are you specifically referring to? > IIRC, it was AIX and Solaris. > How about adding a query function that says what will happen for > hwloc_set_membind() I like it, we can put this in the output of hwloc_topology_get_support. I wonder if there are some other cases where the STRICT flag could be dropped in favor of such an informative stuff. > Just curious -- on these OS's, what happens if you: > > - bind proc to A > - bind memory to B (which then also re-binds proc to B) > - re-bind proc to A > > Is the memory binding then lost? > I'll let Samuel comment on this. >> * cpuset and nodeset structures are the same, they are both manipulated >> with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and >> hwloc_set_foo functions. With #define and aliases to not break API/ABIs. >> > I'm in favor of this -- it would end the overloading of the term "cpuset" > between hwloc and cpuset. > hwloc_set_*? hwloc_objset* ? Anything better? hwloc_set_* might not be the best since we would have a hwloc_set_set() function to set one bit :) By the way, hwloc_cpuset_cpu() and hwloc_cpuset_all_but_cpu() should be renamed too. hwloc_set_onlyone() and hwloc_set_allbutone() maybe? Brice
Re: [OMPI devel] How to add a schedule algorithm to the pml
Sorry for the delay in replying -- I was in Europe for the past two weeks; travel always makes me wy behind on my INBOX... On Sep 14, 2010, at 9:56 PM, 张晶 wrote: > I tried to add a schedule algorithm to the pml component ,ob1 etc. Poorly I > can only find a paper named "Open MPI: A Flexible High Performance MPI" and > some annotation in the source file. From them , I know ob1 has implemented > round-robin& weighted distribution algorithm. But after tracking the > MPI_Send(),I cann't figure out > the location of these implement ,let alone to add a new schedule algorithm. > I have two questions : > 1.The location of the schedule algorithm ? It's complicated -- I'd say that the PML is probably among the most complicated sections of Open MPI because it is the main "engine" that enforces the MPI point-to-point semantics. The algorithm is fairly well distribute throughout the PML source code. :-\ > 2.There are five components :cm,crcpw ,csum ,ob1,V in the pml framework . The > function of these components? cm: this component drives the MTL point-to-point components. It is mainly a thin wrapper for network transports that provide their own MPI-like matching semantics. Hence, most of the MPI semantics are effectively done in the lower layer (i.e., in the MTL components and their dependent libraries). You probably won't be able to do much here, because such transports (MX, Portals, etc.) do most of their semantics in the network layer -- not in Open MPI. If you have a matching network layer, this is the PML that you probably use (MX, Portals, PSM). crcpw: this is a fork of the ob1 PML; it add some failover semantics. csum: this is also a fork of the ob1 PML; it adds checksumming semantics (so you can tell if the underlying transport had an error). v: this PML uses logging and replay to effect some level of fault tolerance. It's a distant fork of the ob1 PML, but has quite a few significant differences. ob1: this is the "main" PML that most users use (TCP, shared memory, OpenFabrics, etc.). It gangs together one or more BTLs to send/receive messages across individual network transports. Hence, it supports true multi-device/multi-rail algorithms. The BML (BTL multiplexing layer) is a thin management later that marshals all the BTLs in the process together -- it's mainly array handling, etc. The ob1 PML is the one that decides multi-rail/device splitting, etc. The INRIA folks just published a paper last week at Euro MPI about adjusting the ob1 scheduling algorithm to also take NUMA/NUNA/NUIOA effects into account, not just raw bandwidth calculations. Hope this helps! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] New Romio for OpenMPI available in bitbucket
On Sep 17, 2010, at 6:36 AM, Pascal Deveze wrote: > In charge of ticket 1888 (see at > https://svn.open-mpi.org/trac/ompi/ticket/1888) , > I have put the resulting code in bitbucket at: > http://bitbucket.org/devezep/new-romio-for-openmpi/ Sweet! > The work in this repo consisted in refreshing ROMIO to a newer > version: the one from the very last MPICH2 release (mpich2-1.3b1). Great! I saw there was another MPICH2 release, and I saw a ROMIO patch or three go by on the MPICH list recently. Do you expect there to be major differences between what you have and those changes? I don't have any parallel filesystems to test with, but if someone else in the community could confirm/verify at least one or two of the parallel filesystems supported in ROMIO, I think we should bring this stuff into the trunk soon. > Testing: > 1. runs fine except one minor error (see the explanation below) on various FS. > 2. runs fine with Lustre, but: >. had to add a small patch in romio/adio/ad_lustre_open.c Did this patch get pushed upstream? > The minor error === > The test error.c fails because OpenMPI does not handle correctly the > "two level" error functions of ROMIO: > error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, > myname, __LINE__, MPI_ERR_ARG, > "**iobaddisp", 0); > OpenMPI limits its view to MPI_ERR_ARG, but the real error is "**iobaddisp". Do you mean that we should be returning an error string "**iobaddisp" instead of "MPI_ERR_ARG"? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[hwloc-devel] roadmap
Hello, hwloc 1.0 was released in May. I think we should release 1.1 before SC10, which means doing a first RC within a couple weeks. trunk got many changes since 1.0, but nothing very important. trac says we're missing memory binding, distances and user-defined process restrictions. Memory binding is the most important one, it was supposed to be in 1.0. I think we shouldn't defer 1.1 because of the others. There are still some problems to solve in the membind branch: * Some OS bind the process too when you bind memory. I see the following solutions: + Add a flag such as HWLOC_MEMBIND_EVEN_IF_FAR_FROM_PROCESS so that the user can explicitly refuse memory binding if it may break process binding + Drop hwloc_set_membind on these OSes and add a hwloc_set_cpumembind() to bind both + Make both process and memory binding do nothing if the STRICT flag is given. But I'd rather not play too much with this flag. + Drop support for memory binding on these OS. + Drop these OS. * cpuset and nodeset structures are the same, they are both manipulated with hwloc_cpuset_foo functions. So maybe rename into hwloc_set_t and hwloc_set_foo functions. With #define and aliases to not break API/ABIs. Opinions ? Brice
Re: [hwloc-devel] hwloc powerpc rhel5 and power7 patch
On 21/09/10 19:34, Samuel Thibault wrote: Just a last question: is it ok to include the /proc and /sys trees you have posted in the hwloc testcases? That's ok.