Re: [OMPI devel] Process placement
Ralph, I still observe these issues in the current master. (npernode is not respected either). Also note that the display_allocation seems to be wrong (slots_inuse=0 when the slot is obviously in use). $ git show 4899c89 (HEAD -> master, origin/master, origin/HEAD) Fix a race condition when multiple threads try to create a bml enBouteiller 6 hours ago $ bin/mpirun -np 12 -hostfile /opt/etc/ib10g.machinefile.ompi -display-allocation -map-by nodehostname == ALLOCATED NODES == dancer00: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer01: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer02: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer03: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer04: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer05: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer06: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer07: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer08: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer09: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer10: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer11: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer12: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer13: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer14: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN dancer15: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN = dancer01 dancer00 dancer01 dancer01 dancer01 dancer00 dancer00 dancer00 dancer00 dancer00 dancer00 dancer00 -- Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ <https://icl.cs.utk.edu/~bouteill/> > Le 13 avr. 2016 à 13:38, Ralph Castain <r...@open-mpi.org> a écrit : > > The —map-by node option should now be fixed on master, and PRs waiting for > 1.10 and 2.0 > > Thx! > >> On Apr 12, 2016, at 6:45 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> FWIW: speaking just to the —map-by node issue, Josh Ladd reported the >> problem on master as well yesterday. I’ll be looking into it on Wed. >> >>> On Apr 12, 2016, at 5:53 PM, George Bosilca <bosi...@icl.utk.edu >>> <mailto:bosi...@icl.utk.edu>> wrote: >>> >>> >>> >>> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet <gil...@rist.or.jp >>> <mailto:gil...@rist.or.jp>> wrote: >>> George, >>> >>> about the process binding part >>> >>> On 4/13/2016 7:32 AM, George Bosilca wrote: >>> Also my processes, despite the fact that I asked for 1 per node, are not >>> bound to the first core. Shouldn’t we release the process binding when we >>> know there is a single process per node (as in the above case) ? >>> did you expect the tasks are bound to the first *core* on each node ? >>> >>> i would expect the tasks are bound to the first *socket* on each node. >>> >>> In this particular instance, where it has been explicitly requested to have >>> a single process per node, I would have expected the process to be unbound >>> (we know there is only one per node). It is the responsibility of the >>> application to bound itself or its thread if necessary. Why are we >>> enforcing a particular binding policy? >>> >>> (since we do not know how many (OpenMP or other) threads will be used by >>> the application, >>> --bind-to socket is a good policy imho. in this case (one task per node), >>> no binding at all would mean >>> the task can migrate from one socket to the other, and/or OpenMP threads >>> are bound accross sockets. >>> That would trigger some NUMA effects (better bandwidth if memory is locally >>> accessed, but worst performance >>> is memory is allocated only on one socket). >>> so imho, --bind-to socket is still my preferred policy, even if there is >>> only one MPI task per node. >>> >>> Open MPI is about MPI ranks/processes. I don't think it is our job to try >>> to figure out how the user handle do with it's own threads. >>> >>> Your justification make sense if the application only uses a single socket. >>> It also make sense if one starts multiple ranks per node, and the internal >>> thre
Re: [OMPI devel] Confusion about slots
To add to what Ralf said, you probably do not want to use Hyper Threads for HPC workloads, as that generally results in very poor performance (as you noticed). Set the number of slots to the number of real cores (not HT), that would yield optimal results 95% of the time. Aurélien -- Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ <https://icl.cs.utk.edu/~bouteill/> > Le 23 mars 2016 à 16:24, Ralph Castain <r...@open-mpi.org> a écrit : > > “Slots” are an abstraction commonly used by schedulers as a way of indicating > how many processes are allowed to run on a given node. It has nothing to do > with hardware, either cores or HTs. > > MPI programmers frequently like to bind a process to one or more hardware > assets (cores or HTs). Thus, you will see confusion in the community where > people mix the term “slot” with “cores” or “cpus”. This is unfortunate as it > the terms really do mean very different things. > > In OMPI, we chose to try and “help” the user by not requiring them to specify > detailed info in a hostfile. So if you don’t specify the number of “slots” > for a given node, we will sense the number of cores on that node and set the > slots to match that number. This best matches user expectations today. > > If you do specify the number of slots, then we use that to guide the desired > number of processes assigned to each node. We then bind each of those > processes according to the user-provided guidance. > > HTH > Ralph > >> On Mar 23, 2016, at 9:35 AM, Federico Reghenzani >> <federico1.reghenz...@mail.polimi.it >> <mailto:federico1.reghenz...@mail.polimi.it>> wrote: >> >> Ok, I've investigated further today, it seems "--map-by hwthread" does not >> remove the problem. However, if I specified in the hostfile "node0 slots=32" >> it runs really slower than specifying only "node0". In both cases I run >> mpirun with -np 32. So I'm quite sure I didn't understand what slots are. >> >> __ >> Federico Reghenzani >> M.Eng. Student @ Politecnico di Milano >> Computer Science and Engineering >> >> >> >> 2016-03-22 18:56 GMT+01:00 Federico Reghenzani >> <federico1.reghenz...@mail.polimi.it >> <mailto:federico1.reghenz...@mail.polimi.it>>: >> Hi guys, >> >> I'm really confused about slots in resource allocation: I thought that slots >> are the number of processes spawnable in a certain node, so it should >> correspond to the number of Processing Elements of the node. For example, on >> each of my nodes I have 2 processors, total 16 cores with hyperthreading, so >> a total of 32 processing elements per node (i.e. 32 hw threads). However, >> considering a single node, passing in the hostfile 32 slots and requesting >> "-np 32" results is a performance degradation of 20x slower than using only >> "-np 16". The problem disappears specifing --map-by hwthread. >> >> Investigating on the problem I found these counterintuitive things: >> - here >> <https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles> is >> stated "slots are Open MPI's representation of how many processors are >> available" >> - here <https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php#sect6> is >> stated "Slots indicate how many processes can potentially execute on a node. >> For best performance, the number of slots may be chosen to be the number of >> cores on the node or the number of processor sockets" >> - I tried to remove the slots information from the hostfile, so according to >> this >> <https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles> >> should be interpreted as "1", but it spawns anyway 32 processes >> - I'm not sure what --map-by and --rank-by do >> >> In custom RAS we are developing, what we have to send to mpirun? The number >> of processor sockets, the number of cores or the number of hwthread >> available? How --map-by and --rank-by affect the spawn policy? >> >> >> Thank you! >> >> >> OFFTOPIC: is someone going to EuroMPI 2016 in September? We will be there to >> present our migration technique. >> >> >> Cheers, >> Federico >> >> __ >> Federico Reghenzani >> M.Eng. Student @ Politecnico di Milano >> Computer Science and Engineering >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/03/18723.php >> <http://www.open-mpi.org/community/lists/devel/2016/03/18723.php> > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/03/18724.php smime.p7s Description: S/MIME cryptographic signature
[OMPI devel] use-mpi mpiext?
I am making an MPI extension in latest master. I have a problem with the use-mpi part of the extension: Makefile.am contains the following 13 headers = \ 14 >...mpiext_blabla_usempi.h 15 16 noinst_HEADERS = \ 17 $(headers) For some reason, the build system tries to compile a .a for the usempi extension. My understanding is that it should use the same bindings as the mpifh.a extension (which builds successfully). make[1]: Leaving directory `/home/bouteill/ompi/debug.build/ompi/mpi/fortran/mpif-h' Making install in mpi/fortran/use-mpi-ignore-tkr make[1]: Entering directory `/home/bouteill/ompi/debug.build/ompi/mpi/fortran/use-mpi-ignore-tkr' FCLD libmpi_usempi_ignore_tkr.la libtool: link: cannot find the library `../../../../ompi/mpiext/blabla/use-mpi/libmpiext_blabla_usempi.la' or unhandled argument `../../../../ompi/mpiext/blabla/use-mpi/libmpiext_blabla_usempi.la' make[1]: *** [libmpi_usempi_ignore_tkr.la] Error 1 -- Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] Remote orted verbosity
Frederico, Just add -debug-daemons to the mpirun command options. Aurélien -- Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ <https://icl.cs.utk.edu/~bouteill/> > Le 23 nov. 2015 à 08:55, Federico Reghenzani > <federico1.reghenz...@mail.polimi.it> a écrit : > > Hi! > > Is there any way to get the output of OPAL_OUTPUT_VERBOSE on remote orteds? > (or write it to a local file?). > > We tried with --mca orte_debug_verbose but it works only for the local > machine (= where mpirun is executed). > > > Cheers, > Federico > > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18383.php
[OMPI devel] smcuda higher exclusivity than anything else?
I was making basic performance measurements on our machine after installing 1.8.5, the performance were looking bad. It turns out that the smcuda btl has a higher exclusivity than both vader and sm, even on machines with no nvidia adapters. Is there a strong reason why the default exclusivity is set so high ? Of course it can be easily fixed with a couple of mca options, but unsuspecting users that “just run” will experience 1/3 overhead across the board for shared memory communication according to my measurements. Side note: from my understanding of the smcuda component, performance should be identical to the regular sm component (as long as no GPU operation are required). This is not the case, there is some performance penalty with smcuda compared to sm. Aurelien -- Aurélien Bouteiller ~~ https://icl.cs.utk.edu/~bouteill/ signature.asc Description: Message signed with OpenPGP using GPGMail
[OMPI devel] 1.8.5rc1 and OOB on Cray XC30
a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- [aprun6-darter:16915] [[54804,0],0] TCP SHUTDOWN [aprun6-darter:16915] mca: base: close: component tcp closed [aprun6-darter:16915] mca: base: close: unloading component tcp -- Aurélien Bouteiller ~ https://icl.cs.utk.edu/~bouteill/ signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI devel] RFC: "v1.9.0" (vs. "v1.9")
During the phase where there is not yet a release of “next”, the README and other documentations employs the number of the not yet released upcoming version. Sometimes when these gets dispatched, outsiders get confused that they are using some release version, when in fact they are running a nightly build. Reserving a particular number (like 1.9.0) for all non-release versions of a general series could help avoid this. -- ~~~ Aurélien Bouteiller, Ph.D. ~~~ ~ Research Scientist @ ICL ~ The University of Tennessee, Innovative Computing Laboratory 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996 tel: +1 (865) 974-9375 fax: +1 (865) 974-8296 https://icl.cs.utk.edu/~bouteill/ Le 22 sept. 2014 à 14:21, Ralph Castain <r...@open-mpi.org> a écrit : > Not sure I understand - what do you mean by a "free" number?? > > On Sep 22, 2014, at 10:50 AM, Aurélien Bouteiller <boute...@icl.utk.edu> > wrote: > >> Could also start at 1.9.1 instead of 1.9.0. That gives a free number for the >> “trunk” nightly builds. >> >> >> -- >> ~~~ Aurélien Bouteiller, Ph.D. ~~~ >>~ Research Scientist @ ICL ~ >> The University of Tennessee, Innovative Computing Laboratory >> 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996 >> tel: +1 (865) 974-9375 fax: +1 (865) 974-8296 >> https://icl.cs.utk.edu/~bouteill/ >> >> >> >> >> Le 22 sept. 2014 à 13:38, Jeff Squyres (jsquyres) <jsquy...@cisco.com> a >> écrit : >> >>> WHAT: Change our version numbering scheme to always include all 3 numbers >>> -- even when the 3rd number is 0. >>> >>> WHY: I think we made a mistake years ago when we designed the version >>> number scheme. It's weird that we drop the last digit when it is 0. >>> >>> WHERE: Trivial patch. See below. >>> >>> WHEN: Tuesday teleconf next week, 30 Sep 2014 >>> >>> MORE DETAIL: >>> >>> Right now, per http://www.open-mpi.org/software/ompi/versions/, when the >>> 3rd digit of our version number is zero, we drop it in the filename and >>> various other outputs (e.g., ompi_info). For example, we have: >>> >>> openmpi-1.8.tar.bz2 >>> instead of openmpi-1.8.0.tar.bz2 >>> >>> Honestly, I think that's just a little weird. I think I was the one who >>> advocated for dropping the 0 way back in the beginning, but I'm now >>> changing my mind. :-) >>> >>> Making this change will be immediately obvious in the filename of the trunk >>> nightly tarball. It won't affect the v1.8 series (or any prior series), >>> because they're all well past their .0 releases. But it will mean that the >>> first release in the v1.9 series will be "v1.9.0". >>> >>> Finally, note that this will also apply to all version numbers shown in >>> ompi_info (e.g., components and projects). >>> >>> Here's the diff: >>> >>> Index: config/opal_get_version.m4 >>> === >>> --- config/opal_get_version.m4 (revision 32771) >>> +++ config/opal_get_version.m4 (working copy) >>> @@ -60,12 +60,7 @@ >>> p" < "$1"` >>> [eval] "$opal_vers" >>> >>> -# Only print release version if it isn't 0 >>> -if test $$2_RELEASE_VERSION -ne 0 ; then >>> - >>> $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION" >>> -else >>> -$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION" >>> -fi >>> + >>> $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION" >>> $2_VERSION="${$2_VERSION}${$2_GREEK_VERSION}" >>> $2_BASE_VERSION=$$2_VERSION >>> >>> Index: opal/runtime/opal_info_support.c >>> === >>> --- opal/runtime/opal_info_support.c(revision 32771) >>> +++ opal/runtime/opal_info_support.c(working copy) >>> @@ -1099,14 +1099,8 @@ >>> temp[BUFSIZ - 1] = '\0'; >>> if (0 == strcmp(scope, opal_info_ver_full) || >>> 0 == strcmp(scope, opal_info_ver_all)) { >>> -snprintf(temp, BUFSIZ - 1, "%d.%d", major, minor); >>> +snprintf(temp, BUFSIZ - 1, "%d.%d.%d", major, minor, release); >>> str = strdup(temp)
Re: [OMPI devel] RFC: "v1.9.0" (vs. "v1.9")
Could also start at 1.9.1 instead of 1.9.0. That gives a free number for the “trunk” nightly builds. -- ~~~ Aurélien Bouteiller, Ph.D. ~~~ ~ Research Scientist @ ICL ~ The University of Tennessee, Innovative Computing Laboratory 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996 tel: +1 (865) 974-9375 fax: +1 (865) 974-8296 https://icl.cs.utk.edu/~bouteill/ Le 22 sept. 2014 à 13:38, Jeff Squyres (jsquyres) <jsquy...@cisco.com> a écrit : > WHAT: Change our version numbering scheme to always include all 3 numbers -- > even when the 3rd number is 0. > > WHY: I think we made a mistake years ago when we designed the version number > scheme. It's weird that we drop the last digit when it is 0. > > WHERE: Trivial patch. See below. > > WHEN: Tuesday teleconf next week, 30 Sep 2014 > > MORE DETAIL: > > Right now, per http://www.open-mpi.org/software/ompi/versions/, when the 3rd > digit of our version number is zero, we drop it in the filename and various > other outputs (e.g., ompi_info). For example, we have: > > openmpi-1.8.tar.bz2 > instead of openmpi-1.8.0.tar.bz2 > > Honestly, I think that's just a little weird. I think I was the one who > advocated for dropping the 0 way back in the beginning, but I'm now changing > my mind. :-) > > Making this change will be immediately obvious in the filename of the trunk > nightly tarball. It won't affect the v1.8 series (or any prior series), > because they're all well past their .0 releases. But it will mean that the > first release in the v1.9 series will be "v1.9.0". > > Finally, note that this will also apply to all version numbers shown in > ompi_info (e.g., components and projects). > > Here's the diff: > > Index: config/opal_get_version.m4 > === > --- config/opal_get_version.m4(revision 32771) > +++ config/opal_get_version.m4(working copy) > @@ -60,12 +60,7 @@ > p" < "$1"` > [eval] "$opal_vers" > > -# Only print release version if it isn't 0 > -if test $$2_RELEASE_VERSION -ne 0 ; then > - > $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION" > -else > -$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION" > -fi > +$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION" > $2_VERSION="${$2_VERSION}${$2_GREEK_VERSION}" > $2_BASE_VERSION=$$2_VERSION > > Index: opal/runtime/opal_info_support.c > === > --- opal/runtime/opal_info_support.c (revision 32771) > +++ opal/runtime/opal_info_support.c (working copy) > @@ -1099,14 +1099,8 @@ > temp[BUFSIZ - 1] = '\0'; > if (0 == strcmp(scope, opal_info_ver_full) || > 0 == strcmp(scope, opal_info_ver_all)) { > -snprintf(temp, BUFSIZ - 1, "%d.%d", major, minor); > +snprintf(temp, BUFSIZ - 1, "%d.%d.%d", major, minor, release); > str = strdup(temp); > -if (release > 0) { > -snprintf(temp, BUFSIZ - 1, ".%d", release); > -asprintf(, "%s%s", str, temp); > -free(str); > -str = tmp; > -} > if (NULL != greek) { > asprintf(, "%s%s", str, greek); > free(str); > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15887.php signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI devel] KNEM + user-space hybrid for sm BTL
Le 18 juil. 2013 à 11:12, "Iliev, Hristo" <il...@rz.rwth-aachen.de> a écrit : > Hello, > > Could someone, who is more familiar with the architecture of the sm BTL, > comment on the technical feasibility of the following: is it possible to > easily extend the BTL (i.e. without having to rewrite it completely from > scratch) so as to be able to perform transfers using both KNEM (or other > kernel-assisted copying mechanism) for messages over a given size and the > normal user-space mechanism for smaller messages with the switch-over point > being a user-tunable parameter? > > From what I’ve seen, both implementations have something in common, e.g. both > use FIFOs to communicate controlling information. > The motivation behind this are our efforts to become greener by extracting > the best possible out of the box performance on our systems without having to > profile each and every user application that runs on them. We’ve already > determined that activating KNEM really benefits some collective operations on > big shared-memory systems, but the increased latency significantly slows down > small message transfers, which also hits the pipelined implementations. > Hristo, The knem BTL currently available in the trunk does just this :) You can use either Knem or Linux CMA to accelerate interprocess transfers. You can use the following mca parameters to turn on knem mode: -mca btl_sm_use_knem 1 If my memory serves me well, anything under eager limit is sent by regular double copy: -mca btl_sm_eager_limit 4096 (is the default, so anything below 1 page is copy-in, copy-out). If I remember correctly, anything below 16k decreased performance. We also have a collective component leveraging on knem capabilities. If you want more info about the details, you can look at the following paper we published at IPDPS last year. It covers what we found to be the best cutoff values for using (or not) knem in several collective. Teng Ma, George Bosilca, Aurelien Bouteiller, Jack Dongarra, "HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters," Parallel and Distributed Processing Symposium, International, pp. 970-982, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012 http://www.computer.org/csdl/proceedings/ipdps/2012/4675/00/4675a970-abs.html Enjoy, Aurelien > sm’s code doesn’t seem to be very complex but still I’ve decided to ask first > before diving any deeper. > > Kind regards, > Hristo > -- > Hristo Iliev, PhD – High Performance Computing Team > RWTH Aachen University, Center for Computing and Communication > Rechen- und Kommunikationszentrum der RWTH Aachen > Seffenter Weg 23, D 52074 Aachen (Germany) > > > _______ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
Re: [OMPI devel] [EXTERNAL] Re: RFC: support for Mellanox's "libhcoll" library
If it is Mellanox specific, maybe the component name could reflect this (like mlxhcoll), as it will be visible to end-users. Aurelien Le 18 juin 2013 à 11:25, "Barrett, Brian W" <bwba...@sandia.gov> a écrit : > In general, I'm ok with it. I think we should let it soak for a week or > two in the trunk before we file the CMR to 1.7. > > Brian > > On 6/18/13 6:51 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: > >> Sounds good; +1. >> >> On Jun 18, 2013, at 8:02 AM, Joshua Ladd <josh...@mellanox.com> wrote: >> >>> Request for Change: >>> >>> What: Add support for Mellanox Technologies¹ next-generation >>> non-blocking collectives, code-named ³libhcoll². This comes in the form >>> of a new ³hcoll² component to the ³coll² framework. >>> >>> Where: Trunk and 1.7 >>> >>> When: July 1 >>> >>> Why: In support of MPI 3, Mellanox Technologies will make available its >>> next generation collectives library, ³libhcoll², in MOFED 2.0 releases >>> and higher starting in the late 2013 timeframe. ³Libhcoll² adds support >>> for truly asynchronous non-blocking collectives on supported HCAs >>> (Connect X-3 and higher) via Mellanox Technologies¹ CORE-Direct >>> technology. ³Libhcoll² also adds support for hierarchical collectives >>> and features a highly scalable infrastructure battle tested and proven >>> on some of the world¹s largest HPC systems. >>> >>> >>> >>> >>> >>> Joshua S. Ladd, PhD >>> HPC Algorithms Engineer >>> Mellanox Technologies >>> >>> Email: josh...@mellanox.com >>> Cell: +1 (865) 258 - 8898 >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Brian W. Barrett > Scalable System Software Group > Sandia National Laboratories > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
Re: [OMPI devel] June OMPI developer's meeting
I will be attending. Can some local chime in and tell me how practical it is to land in San Francisco and use public transportation to go to San Jose? Plane schedule to San Jose directly is not very flexible. Aurelien Le 7 mai 2013 à 15:19, Larry Baker <ba...@usgs.gov> a écrit : > On 6 May 2013, at 11:14 AM, Jeff Squyres (jsquyres) wrote: > >> We typically do something informally scheduled on the day of, or somesuch >> (e.g., around 4pm people start wondering aloud what we should do for dinner >> :-) ). But if there is interest for others to attend, we can probably set >> up something ahead of time. > > This option will work best for me. All I need is an e-mail notice of where > and when within 30 minutes or so of the reservation time (depending on the > traffic on 101 :) ). > > Larry Baker > US Geological Survey > 650-329-5608 > ba...@usgs.gov > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
Re: [OMPI devel] enabling ft-enable-cr + vprotocol
Tiago, I have never tried to do this, I'm sorry to hear it doesn't work. I am very busy at the moment, but I'll try to upgrade the pessimist protocol in the trunk with my latest internal repo, that contains some features to mix coordinated and message logging, as soon as possible. Aurelien Le 22 juil. 2012 à 18:47, tiago essex a écrit : > hi, > > i have been playing around with the code of the pessimist protocol and i have > set it so to save some messages and some other specific information into a > few files. > > however i also need to be able to perform global checkpoint during execution. > i was wondering if it's possible to simultaneous set the mca parameters for > both the coordinated checkpoint and the vprototocol at the same time, > something like this: > > mpirun -n 10 -am ft-enable-cr -mca crs blcr -mca vprotocol pessimist prog > > i have tried, but it seems that the vprotocol does not work with ft-enable-cr > enable. is there a way to overcome this? or i'm missing something? > > > thank you ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375 signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI devel] Pessimist Event Logger
Hugo, It seems you want to implement some sort of remote pessimistic logging -a la MPICH-V1- ? MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes -- George Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric Magniette, Vincent Néri, Anton Selikhov -- In proceedings of The IEEE/ACM SC2002 Conference, Baltimore USA, November 2002 In the PML-V, unlike older designs, the payload of messages and the non-deterministic events follow a different path. The payload of messages is logged on the sender's volatile memory, while the non-deterministic events are sent to a stable event logger, before allowing the process to impact the state of others (the code you have found in the previous email). The best depiction of this distinction can be read in this paper @inproceedings{DBLP:conf/europar/BouteillerHBD11, author= {Aurelien Bouteiller and Thomas H{\'e}rault and George Bosilca and Jack J. Dongarra}, title = {Correlated Set Coordination in Fault Tolerant Message Logging Protocols}, booktitle = {Euro-Par 2011 Parallel Processing - 17th International Conference, Proceedings, Part II}, month = {September}, year = {2011}, pages = {51-64}, publisher = {Springer}, series= {Lecture Notes in Computer Science}, volume= {6853}, year = {2011}, isbn = {978-3-642-23396-8}, doi = {http://dx.doi.org/10.1007/978-3-642-23397-5_6}, If you intend to store both payload and message log on a remote node, I suggest you look at the "sender-based" hooks, as this is where the message payload is managed, and adapt from here. The event loggers can already manage a subset only of the processes (if you launch as many EL as processes, you get a 1-1 mapping), but they never handle message payload; you'll have to add all this yourself is it so pleases you. Hope it clarifies. Aurelien Le 27 janv. 2012 à 11:19, Hugo Daniel Meyer a écrit : > Hello Aurélien. > > Thanks for the clarification. Considering what you've mentioned i will have > to make some adaptations, because to me, every single message has to be > logged. So, a sender not only will be sending messages to the receiver, but > also to an event logger. Is there any considerations that i've to take into > account when modifying the code?. My initial idea is to use the el_comm with > a group of event loggers (because every node uses a different event logger in > my approach), and then send the messages to them as you do when using > MPI_ANY_SOURCE. > > Thanks for your help. > > Hugo Meyer > > 2012/1/27 Aurélien Bouteiller <boute...@eecs.utk.edu> > Hugo, > > Your program does not have non-deterministic events. Therefore, there are no > events to log. If you add MPI_ANY_SOURCE, you should see this code being > called. Please contact me again if you need more help. > > Aurelien > > > Le 27 janv. 2012 à 10:21, Hugo Daniel Meyer a écrit : > > > Hello @ll. > > > > George, i'm using some pieces of the pessimist vprotocol. I've observed > > that when you do a send, you call vprotocol_receiver_event_flush and here > > the macro __VPROTOCOL_RECEIVER_SEND_BUFFER is called. I've noticed that > > here you try send a copy of the message to process 0 using the el_comm. > > This section of code is never executed, at least in my examples. So, the > > message is never sent to the Event Logger, am i correct with this? I think > > that this is happening because the > > mca_vprotocol_pessimist.event_buffer_length is always 0. > > > > Is there something that i've got to turn on, or i will have to modify this > > behavior manually to connect and send messages to the EL? > > > > Thanks in advance. > > > > Hugo Meyer > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > * Dr. Aurélien Bouteiller > * Researcher at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 350 > * Knoxville, TN 37996 > * 865 974 6321 > > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI devel] Pessimist Event Logger
Hugo, Your program does not have non-deterministic events. Therefore, there are no events to log. If you add MPI_ANY_SOURCE, you should see this code being called. Please contact me again if you need more help. Aurelien Le 27 janv. 2012 à 10:21, Hugo Daniel Meyer a écrit : > Hello @ll. > > George, i'm using some pieces of the pessimist vprotocol. I've observed that > when you do a send, you call vprotocol_receiver_event_flush and here the > macro __VPROTOCOL_RECEIVER_SEND_BUFFER is called. I've noticed that here you > try send a copy of the message to process 0 using the el_comm. This section > of code is never executed, at least in my examples. So, the message is never > sent to the Event Logger, am i correct with this? I think that this is > happening because the mca_vprotocol_pessimist.event_buffer_length is always 0. > > Is there something that i've got to turn on, or i will have to modify this > behavior manually to connect and send messages to the EL? > > Thanks in advance. > > Hugo Meyer > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [OMPI devel] [OMPI svn] svn:open-mpi r23931
Ralph, In file included from ../../../../../trunk/opal/mca/event/libevent207/libevent207_module.c:44: ../../../../../trunk/opal/mca/event/libevent207/libevent/event.h:165:33: error: event2/event-config.h: No such file or directory Looks like you forgot some files. Aurelien Le 25 oct. 2010 à 10:53, r...@osl.iu.edu a écrit : > Author: rhc > Date: 2010-10-25 10:53:33 EDT (Mon, 25 Oct 2010) > New Revision: 23931 > URL: https://svn.open-mpi.org/trac/ompi/changeset/23931 > > Log: > Remove the sample and test code from the libevent distro - don't need to > include them in ompi > > Removed: > trunk/opal/mca/event/libevent207/libevent/sample/ > trunk/opal/mca/event/libevent207/libevent/test/ > Text files modified: > trunk/opal/mca/event/libevent207/libevent/Makefile.am | 2 +- > > trunk/opal/mca/event/libevent207/libevent/configure.in | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > Modified: trunk/opal/mca/event/libevent207/libevent/Makefile.am > == > --- trunk/opal/mca/event/libevent207/libevent/Makefile.am (original) > +++ trunk/opal/mca/event/libevent207/libevent/Makefile.am 2010-10-25 > 10:53:33 EDT (Mon, 25 Oct 2010) > @@ -85,7 +85,7 @@ > libevent.pc.in \ > Doxyfile \ > whatsnew-2.0.txt \ > - Makefile.nmake test/Makefile.nmake \ > + Makefile.nmake \ > $(PLATFORM_DEPENDENT_SRC) > > # OMPI: Changed to noinst and libevent.la > > Modified: trunk/opal/mca/event/libevent207/libevent/configure.in > == > --- trunk/opal/mca/event/libevent207/libevent/configure.in(original) > +++ trunk/opal/mca/event/libevent207/libevent/configure.in2010-10-25 > 10:53:33 EDT (Mon, 25 Oct 2010) > @@ -838,4 +838,4 @@ > fi > > AC_CONFIG_FILES( [libevent.pc libevent_openssl.pc libevent_pthreads.pc] ) > -AC_OUTPUT(Makefile include/Makefile test/Makefile sample/Makefile) > +AC_OUTPUT(Makefile include/Makefile) > ___ > svn mailing list > s...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn
[OMPI devel] orte does not compile on XT5 (pgcc)
Here is the problem. The PGI compiler is especially paranoid regarding post declared structures typedefs. It looks like the include ordering makes the nidmap.h file being included before orte_jmap_t typedefs and siblings have been done. /opt/cray/xt-asyncpe/4.0/bin/cc: INFO: linux target is being used PGC-S-0040-Illegal use of symbol, orte_jmap_t (../../../../../trunk/orte/util/nidmap.h: 47) PGC-W-0156-Type not specified, 'int' assumed (../../../../../trunk/orte/util/nidmap.h: 47) PGC-S-0040-Illegal use of symbol, orte_pmap_t (../../../../../trunk/orte/util/nidmap.h: 48) PGC-W-0156-Type not specified, 'int' assumed (../../../../../trunk/orte/util/nidmap.h: 48) PGC-S-0040-Illegal use of symbol, orte_nid_t (../../../../../trunk/orte/util/nidmap.h: 49) PGC-W-0156-Type not specified, 'int' assumed (../../../../../trunk/orte/util/nidmap.h: 49) PGC-S-0040-Illegal use of symbol, orte_jmap_t (../../../../../trunk/orte/util/nidmap.h: 63) PGC-W-0156-Type not specified, 'int' assumed (../../../../../trunk/orte/util/nidmap.h: 63) PGC-S-0074-Non-constant expression in initializer (../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 95) PGC-S-0074-Non-constant expression in initializer (../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 103) PGC-W-0093-Type cast required for this conversion of constant (../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 109) PGC-W-0093-Type cast required for this conversion of constant (../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 109) PGC/x86-64 Linux 10.5-0: compilation completed with severe errors Aurelien
Re: [OMPI devel] Autogen.pl, romio and autoconf 2.66
Le 28 sept. 2010 à 18:10, Aurélien Bouteiller a écrit : > > Le 28 sept. 2010 à 17:55, Jeff Squyres a écrit : > >> On Sep 28, 2010, at 5:30 PM, Aurélien Bouteiller wrote: >> >>> Hi there, >>> >>> has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when >>> configuring romio with the following error: >>> === Processing subdir: >>> /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio >>> --- Found configure.in|ac; running autoreconf... >>> autoreconf: Entering directory `.' >>> autoreconf: configure.in: not using Gettext >>> autoreconf: running: aclocal --force >>> configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library >> >> This looks like Libtool or Automake isn't installed properly...? You were right on that one. The system provided automake on Kraken is broken. Fixed by installing my own. >> > That's a possibility, but one problem at a time :) >> >> >>> configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments >>> ../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from... > Apparently, after making some internet search, it looks like autoconf 2.66 is > plain broken. I'll try with another one and report on this issue. > Confirmed. Autoconf 2.66 cannot compile romio, 2.68 and 2.65 can, no problem. > Aurelien > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Autogen.pl, romio and autoconf 2.66
Le 28 sept. 2010 à 17:55, Jeff Squyres a écrit : > On Sep 28, 2010, at 5:30 PM, Aurélien Bouteiller wrote: > >> Hi there, >> >> has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when >> configuring romio with the following error: >> === Processing subdir: >> /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio >> --- Found configure.in|ac; running autoreconf... >> autoreconf: Entering directory `.' >> autoreconf: configure.in: not using Gettext >> autoreconf: running: aclocal --force >> configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library > > This looks like Libtool or Automake isn't installed properly...? > That's a possibility, but one problem at a time :) > > >> configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments >> ../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from... Apparently, after making some internet search, it looks like autoconf 2.66 is plain broken. I'll try with another one and report on this issue. Aurelien
[OMPI devel] Autogen.pl, romio and autoconf 2.66
Hi there, has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when configuring romio with the following error: === Processing subdir: /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio --- Found configure.in|ac; running autoreconf... autoreconf: Entering directory `.' autoreconf: configure.in: not using Gettext autoreconf: running: aclocal --force configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments ../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from... configure.in:791: the top level autom4te: /sw/xt/autoconf/2.66/cnl2.2_gnu4.4.4/bin/m4 failed with exit status: 1 aclocal: /sw/xt/autoconf/2.66/cnl2.2_gnu4.4.4/bin/autom4te failed with exit status: 1 autoreconf: aclocal failed with exit status: 1 Command failed: autoreconf -ivf Should I demote my autoconf to 2.65 ? Thanks, Aurelien
Re: [OMPI devel] what's the relationship between proc, endpoint and btl?
btl is the component responsible for a particular type of fabric. Endpoint is somewhat the instantiation of a btl to reach a particular destination on a particular fabric, proc is the generic name and properties of a destination. Aurelien Le 24 févr. 2010 à 09:59, hu yaohui a écrit : > Could someone tell me the relationship between proc,endpoint and btl? > thanks & regards > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Vprotocol pessimist - Open MPI 1.4.1 and 1.4.2a1r22558
Hi, The instructions you found are now obsolete. I'll update them, thank you for pointing out. The new procedure to use uncoordinated checkpoint is now mpirun -mca vprotocol pessimist -mca pml ob1,v [regular arguments]. The version available in trunk does not support actual restart due to lack of runtime support, and is limited to performance evaluation of FT cost without failures. There is an ongoing proposal to include such support in the main branch. However, we do have a branched version of Open MPI including all the necessary support that I can be provided on request. Please also consider that this is an ongoing research effort that has not yet matured enough to be used in a production environment. Aurelien Bouteiller -- Dr. Aurelien Bouteiller Innovative Computing Laboratory at the University of Tennessee Le 6 févr. 2010 à 10:21, Caciano Machado a écrit : > Hi, > > I'm following the instructions found at > https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR to run an > application with the vprotocol pessimist enabled. I believe that I'm > doing something wrong but I can't figure out the problem. > > I have compiled Open MPI 1.4.1 and 1.4.2a1r22558 with the parameters: > ./configure --prefix=/usr/local/openmpi-v/ --with-ft=cr > --with-blcr=/usr/local/blcr/ > > Here is my configuration file: > vprotocol_pessimist_priority=10 > pml_base_verbose=10 > pbl_v_verbose=500 > > The command line: > mpirun -am /etc/v -np 2 -machinefile /etc/machinefile ep.B.8 > > And the mpirun output: > ##3 > [xiru-10:03440] mca: base: components_open: Looking for pml components > [xiru-10:03440] mca: base: components_open: opening pml components > [xiru-10:03440] mca: base: components_open: found loaded component cm > [xiru-10:03440] mca: base: components_open: component cm has no > register function > [xiru-10:03440] mca: base: component_find: unable to open > /usr/local/openmpi-v/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, > or compiled for a different version of Open MPI? (ignored) > > [xiru-10:03440] mca: base: components_open: component cm open function > successful > [xiru-10:03440] mca: base: components_open: found loaded component crcpw > [xiru-10:03440] mca: base: components_open: component crcpw has no > register function > [xiru-10:03440] mca: base: components_open: component crcpw open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component csum > [xiru-10:03440] mca: base: components_open: component csum has no > register function > [xiru-10:03440] mca: base: component_find: unable to open > /usr/local/openmpi-v/lib/openmpi/mca_btl_mx: perhaps a missing symbol, > or compiled for a different version of Open MPI? (ignored) > [xiru-10:03440] mca: base: components_open: component csum open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component ob1 > [xiru-10:03440] mca: base: components_open: component ob1 has no > register function > [xiru-10:03440] mca: base: components_open: component ob1 open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component v > [xiru-10:03440] mca: base: components_open: component v has no register > function > [xiru-10:03440] mca: base: components_open: component v open function > successful > -- > [[65326,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: xiru-10.portoalegre.grenoble.grid5000.fr > > Another transport will be used instead, although this may result in > lower performance. > -- > [xiru-10:03440] select: initializing pml component cm > [xiru-10:03440] select: init returned failure for component cm > [xiru-10:03440] select: component crcpw not in the include list > [xiru-10:03440] select: component csum not in the include list > [xiru-10:03440] select: initializing pml component ob1 > [xiru-10:03440] select: init returned priority 20 > [xiru-10:03440] select: component v not in the include list > [xiru-10:03440] selected ob1 best priority 20 > [xiru-10:03440] select: component ob1 selected > [xiru-10:03440] mca: base: close: component cm closed > [xiru-10:03440] mca: base: close: unloading component cm > [xiru-10:03440] mca: base: close: component crcpw closed > [xiru-10:03440] mca: base: close: unloading component crcpw > [xiru-10:03440] mca: base: close: component csum closed > [xiru-10:03440] mca: base: close: unloading component csum > [xiru-10:03440] mca: base: close: component v closed > [xiru-10:03440] mca: base: close: unloading component v > ... > > #3 > > It seems that the vprotocol module is not loading properly. Does > anyone have a solution to
[OMPI devel] MCA component dependency
Hi everyone, I'm trying to state that a particular component depends on another that should therefore be dlopened automatically when it is loaded. I found some code doing exactly that in mca_base_component_find:open_component, but can't find any example of how to actually trigger that code path. Does anybody have a clue of what should I do to declare the list of dependencies of my component ? Thanks, Aurelien
Re: [OMPI devel] [OMPI svn] svn:open-mpi r20196
Addendum to the previous message concerning this discussion: I think we should stick with including opal_stdint everywhere instead of inttypes.h (this file does not always exist on ansi pedantic compilers). Aurelien Le 4 janv. 09 à 00:09, timat...@osl.iu.edu a écrit : Author: timattox Date: 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009) New Revision: 20196 URL: https://svn.open-mpi.org/trac/ompi/changeset/20196 Log: Refs #868, #869 The fix for #868, r14358, introduced an (unneeded?) inconsitency... For Mac OS X systems, inttypes.h will always be included with opal_config.h, and NOT included for non-Mac OS X systems. For developers using Mac OS X, this masks the need to include inttypes.h or more properly opal_stdint.h. This changeset corrects one of these oopses. However, the underlying problem still exists. Moving the equivelent of r14358 into opal_stdint.h from opal_config_bottom.h might be the "right" solution, but AFAIK, we would then need to replace each direct inclusion of inttypes.h with opal_stdint.h to properly address tickets #868 and #869. Text files modified: trunk/opal/dss/dss_print.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) Modified: trunk/opal/dss/dss_print.c = = = = = = = = == --- trunk/opal/dss/dss_print.c (original) +++ trunk/opal/dss/dss_print.c 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009) @@ -18,6 +18,7 @@ #include "opal_config.h" +#include "opal_stdint.h" #include #include "opal/dss/dss_internal.h" ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn
Re: [OMPI devel] [OMPI svn] svn:open-mpi r20196
Tim, To answer to your question in ticket #869: the only known missing feature to the opal_stdint.h is that there is no portable way to printf size_t. Their type is subject to so many changes depending on the platform and compiler that it is impossible to be sure that PRI_size_t is not gonna dump a lot of warnings. Aside from that, it should be pretty solid. Aurelien Le 4 janv. 09 à 00:09, timat...@osl.iu.edu a écrit : Author: timattox Date: 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009) New Revision: 20196 URL: https://svn.open-mpi.org/trac/ompi/changeset/20196 Log: Refs #868, #869 The fix for #868, r14358, introduced an (unneeded?) inconsitency... For Mac OS X systems, inttypes.h will always be included with opal_config.h, and NOT included for non-Mac OS X systems. For developers using Mac OS X, this masks the need to include inttypes.h or more properly opal_stdint.h. This changeset corrects one of these oopses. However, the underlying problem still exists. Moving the equivelent of r14358 into opal_stdint.h from opal_config_bottom.h might be the "right" solution, but AFAIK, we would then need to replace each direct inclusion of inttypes.h with opal_stdint.h to properly address tickets #868 and #869. Text files modified: trunk/opal/dss/dss_print.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) Modified: trunk/opal/dss/dss_print.c = = = = = = = = == --- trunk/opal/dss/dss_print.c (original) +++ trunk/opal/dss/dss_print.c 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009) @@ -18,6 +18,7 @@ #include "opal_config.h" +#include "opal_stdint.h" #include #include "opal/dss/dss_internal.h" ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn
Re: [OMPI devel] Should visibility and memchecker abort configure?
Hi Ralph, 1. No. Having visibility turned off without knowing it is the best way for us to commit bugs in the trunk without noticing, I mean before somebody else get the leg caught in the "not-compiling-trunk trap". I had more of my share of responsibility for that kind of problems in the past, that exactly rooted in visibility issues. I must say that it is painful enough that some compilers will just silently ignore visibility settings without adding the configure to the chain of guys who just do whatever they want regardless of the requested flags. If I can't have visibility, I want to know. Especially in debug mode. 2. If Valgrind is not available and this feature requires valgrind, it is reasonable to disable it. Anyway, this would not lead to include silent bugs in the trunk if it gets disabled "silently". (are you sure though ? I used to enable this on my mac, where there is of course no valid valgrind installed, and it compiled just fine). Aurelien Le 2 oct. 08 à 18:04, Ralph Castain a écrit : Hi folks I make heavy use of platform files to provide OMPI support for the three NNSA labs. This means supporting multiple compilers, several different hardware and software configs, debug vs optimized, etc. Recently, I have encountered a problem that is making life difficult. The problem revolves around two configure options that apply to debug builds: 1. --enable-visibility. Frustrating as it may be, some compilers just don't support visibility - and others only support it for versions above a specific level. Currently, this option will abort the configure procedure if the compiler does not support visibility. 2. --enable-memchecker. This framework has a component that requires valgrind 3.2 or above. Unfortunately, if a valgrind meeting that criteria is not found, this option will also abort the configure procedure. Is it truly -necessary- for these options to abort configure in these conditions? Would it be acceptable for: * visibility just to print a big warning, surrounded by asterisks, that the selected compiler does not support visibility - but allow the build to continue? * memchecker to also print a big warning, surrounded by asterisks, explaining the valgrind requirement and turn "off" the build of the memchecker/valgrind component - but allow the build to continue? It would seem to me that we would certainly want this for the future anyway as additional memchecker components are supported. If this would be acceptable, I am happy to help with or implement the changes. It would be greatly appreciated. Thanks Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r19653
we need to extend the existing RML function to handle the subsequent setting of the route to the proc itself. In the current dpm, we automatically assume that the route will be to a different job family, and hence send the routing info to the HNP. However, this may not be true - e.g., after a comm_spawn, there is no reason to route through the HNP since the job family is the same. This is not correct. The current code in the DPM already takes care of the "usual" case where both ends are in the same job family; in that case it creates a "direct" route to the remote end (maybe it should just do nothing, though). This logic is pretty simple and is well contained in the DPM. Moving this logic to the rml should not basically change much: the complexity will just move from the dpm to the routed. The existing single dpm code already do everything we need for current and future use, while we might have to upgrade all the routed to take into account this special case. This is why I would advocate for the lesser effort for the exact same functionality at the end. Haven't thought it all through yet, but wanted to suggest we think about it as we may (per the FT July discussions) need to define routes for things other than just DPM-related operations. Perhaps we should do some design discussion off-list to see what makes sense? I'm always open to discussion. Let me know if you find this useful on some purpose. Thanks Ralph Aurelien On Sep 28, 2008, at 8:33 AM, Aurélien Bouteiller wrote: Ralph, I just split the existing static function from inside the dpm and exposed it to the outside world. The idea is that the dpm create the (opaque) port strings and therefore nows how they are supposed to be formated. So he is responsible for parsing them. Second, I split the parsing and routing in two different functions because sometimes you might want to parse without creating a route to the target. I'll check the RML function to see if it offers similar functionality om monday. I have no strongly religious belief on wether this should be a rml or dpm function. So I don't care as long as I have what I need :] Thanks for the feedback, Aurelien Le 27 sept. 08 à 20:53, Ralph Castain a écrit : Yo Aurelien Regarding the dpm including a "route_to_port" API. This actually is pretty close to being an exact duplicate of an already existing function in the RML that takes a URI as it's input, parses it to separate the proc name and the contact info, sets the contact info into the OOB, sets the route to that proc, and returns the proc name to the caller. Take a look at orte/mca/rml/base/ rml_base_contact.c. All we need to do is add the logic to that function so that, if the target proc is not in our job family, we update the route and contact info in the HNP instead of locally. This would keep all the "setting_route_to_proc" functionality in one place, instead of duplicating it in the dpm, thus making maintenance much easier. Make sense? Ralph On Sep 27, 2008, at 7:22 AM, boute...@osl.iu.edu wrote: Author: bouteill Date: 2008-09-27 09:22:32 EDT (Sat, 27 Sep 2008) New Revision: 19653 URL: https://svn.open-mpi.org/trac/ompi/changeset/19653 Log: Add functions to access the opaque port_string and to add routes to a remote port. This is usefull for FT, but could also turn usefull when considering MPI3 extentions to the MPI2 dynamics. Text files modified: trunk/ompi/mca/dpm/base/base.h | 3 + trunk/ompi/mca/dpm/base/dpm_base_null_fns.c |12 trunk/ompi/mca/dpm/base/dpm_base_open.c | 2 trunk/ompi/mca/dpm/dpm.h|20 +++ trunk/ompi/mca/dpm/orte/dpm_orte.c | 114 ++ +++-- 5 files changed, 99 insertions(+), 52 deletions(-) Modified: trunk/ompi/mca/dpm/base/base.h = = = = = = = = = = = === --- trunk/ompi/mca/dpm/base/base.h (original) +++ trunk/ompi/mca/dpm/base/base.h 2008-09-27 09:22:32 EDT (Sat, 27 Sep 2008) @@ -92,6 +92,9 @@ int ompi_dpm_base_null_dyn_finalize (void); void ompi_dpm_base_null_mark_dyncomm (ompi_communicator_t *comm); int ompi_dpm_base_null_open_port(char *port_name, orte_rml_tag_t given_tag); +int ompi_dpm_base_null_parse_port(char *port_name, + orte_process_name_t *rproc, orte_rml_tag_t *tag); +int ompi_dpm_base_null_route_to_port(char *rml_uri, orte_process_name_t *rproc); int ompi_dpm_base_null_close_port(char *port_name); /* useful globals */ Modified: trunk/ompi/mca/dpm/base/dpm_base_null_fns.c = = = = = = = = = = = === --- trunk/ompi/mca/dpm/base/dpm_base_null_fns.c (original) +++ trunk/ompi/mca/dpm/base/dpm_base_null_fns.c 2008-09-27 09:22:32 EDT (Sat, 2
Re: [OMPI devel] trunk temporarily closed
Any idea of a timeframe for the problem to get fixed ? Aurelien Le 25 sept. 08 à 14:03, Jeff Squyres a écrit : On Sep 25, 2008, at 1:44 PM, Jeff Squyres (jsquyres) wrote: The SVN trunk has been temporarily closed due to what may be an accidental commit. The entire OMPI SVN is now offline (vs. just the trunk). -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] gdb libmpi.dylib on Leopard
I filed the following bug report on Apple Developer Connexion. As a short summary, I suggest they get in touch with us and include the -- whole-archive mechanism in their ld. Aurelien 19-Sep-2008 03:08 PM Aurelien Bouteiller: Summary: Because the Apple ld does not include the GNU's ld --whole-archive/-- no-whole-archive mechanism to allow loading of all members of selective archives, libtool (including gnu libtool) is forced to unpack all the members of a convenience library (and later delete them), and afterwards needs to run dsymutil. Unfortunately, because the archives are uncompressed to a temporary space before being included in the final library, dsymutil seams to get confused. As a consequence, it is impossible to debug a library with gdb, the .o files never being found, even if the library actually contains all the necessary debug symbols. Steps to reproduce: 1. Download a svn Open MPI trunk release (or any libtool based project, I've experienced the same problems when compiling my own gcc4.3). Please note that you need autoconf 2.62 and automake 1.10 to compile Open MPI trunk. 2. configure Open MPI with the debug options (configure --enable-debug) 3. make install 4. find or create a sample MPI program, mpicc it. 5. mpirun -np 1 gdb mpi_sample_program 6. break MPI_Init, r, n. Expected results: 6: you should step each line of the MPI_Init function Actual results: 6. you see a large number of warnings warning: Could not find object file "/Users/bouteill/ompi/debug.build/ opal/.libs/libopen-pal.lax/libmca_memchecker.a/memchecker_base_open.o" - no debug information available for "../../../../trunk/opal/mca/ memchecker/base/memchecker_base_open.c". You are unable to step in MPI_Init. Instead the execution continues up to reach the "main" function. Regression: Used to work with Tiger. Notes: If you need some more details or want to cooperate with us, please register to the Open MPI devel mailing list. As a major open source project we have been working on a fix for this issue for a while, but where unable to correct it without modifications to apple's ld. We believe that the best workaround would be to include the --whole- archive/--no-whole-archive mechanism. Then there is no need anymore to unpack the convenience archives before building the .dylib, and as a friendly side effect compilation time should improve a lot. Thanks, -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 (on behalf of the Open MPI development community) Le 19 sept. 08 à 17:22, Jeff Squyres a écrit : Thanks for following up! Aurelien, I'll leave this to you -- I rarely do OMPI development on my Mac... On Sep 19, 2008, at 5:08 PM, Ralf Wildenhues wrote: Hello, I asked Peter O'Gorman about this issue, and he said | I believe that running dsymutil on the generated lib would then create a | libfoo.dSYM in the .libs directory conatining all the necessary | debugging information, which could be used for debugging the library in | the build tree (gdb should find it sitting there next to the original | library and use the debug information in the .dSYM). Libtool-2.2.6 does | run dsymutil and create the .dSYM though... | | There should be a libmpi.dylib in a .libs directory and a | libmpi.dylib.dSYM directory next to it. Also, he said that it could help if you reported a bug at <http://bugreporter.apple.com>, under the notion that the more people file bugs with them, the more they will understand what problems users have with the dsymutils issues. Cheers, Ralf * Aurélien Bouteiller wrote on Fri, Sep 19, 2008 at 09:44:46PM CEST: Ok, I didn't forgot to rerun autogen.sh (I even erased the libltdl, and various libtool wrappers that are generated at autogen/configure time). I checked the link Ralf submitted to our attention. This is exactly the same problem, or at least the same symptoms. The last version of libtool runs dsymutil on the created .so/.dylib, but the bad thing is that dsymutil returns similar warning message about missing ".lax" files. Therefore, even running it manually on the .dsym does not help. I upgraded (compiled my own copy) my gcc to 4.3.2 (you should do it too, Jeff, the experimental have been giving me headaches in the past). Now, I also have the same warning messages for internal libs of gcc than for open MPI. This leads me to believe this is not an Open MPI bug, but more probably a libtool/ld issue. I'll switch to linux for my devel for now, but if you have any success story... Aurelien Le 19 sept. 08 à 15:20, Jeff Squyres a écrit : I get the same problem on my MBP with 10.5.5. However, I'm running the gcc from hpc.sf.net: - [15:16] rtp-jsquyres-8713:~/mpi % gcc --version gcc (GCC) 4.3.0 20071026 (experiment
Re: [OMPI devel] gdb libmpi.dylib on Leopard
Ok, I didn't forgot to rerun autogen.sh (I even erased the libltdl, and various libtool wrappers that are generated at autogen/configure time). I checked the link Ralf submitted to our attention. This is exactly the same problem, or at least the same symptoms. The last version of libtool runs dsymutil on the created .so/.dylib, but the bad thing is that dsymutil returns similar warning message about missing ".lax" files. Therefore, even running it manually on the .dsym does not help. I upgraded (compiled my own copy) my gcc to 4.3.2 (you should do it too, Jeff, the experimental have been giving me headaches in the past). Now, I also have the same warning messages for internal libs of gcc than for open MPI. This leads me to believe this is not an Open MPI bug, but more probably a libtool/ld issue. I'll switch to linux for my devel for now, but if you have any success story... Aurelien Le 19 sept. 08 à 15:20, Jeff Squyres a écrit : I get the same problem on my MBP with 10.5.5. However, I'm running the gcc from hpc.sf.net: - [15:16] rtp-jsquyres-8713:~/mpi % gcc --version gcc (GCC) 4.3.0 20071026 (experimental) ... - Not the /usr/bin/gcc that ships with Leopard. I don't know if that matters or not. I'm using AC 2.63, AM 1.10.1, LT 2.2.6a with a fairly vanilla build of Open MPI: ./configure --prefix=/Users/jsquyres/bogus --disable-mpi-f77 -- enable-mpirun-prefix-by-default Here's what happens -- I fire up an MPI program and it deadlocks. I attach to an MPI process PID with gdb (I am using /usr/bin/gdb -- the Leopard-shipped gdb). I get oodles of messages like Aurelien's: - warning: Could not find object file "/data/jsquyres/svn/ompi/ ompi/.libs/libmpi.lax/libdatatype.a/convertor.o" - no debug information available for "convertor.c". warning: Could not find object file "/data/jsquyres/svn/ompi/ ompi/.libs/libmpi.lax/libdatatype.a/copy_functions.o" - no debug information available for "copy_functions.c". warning: Could not find object file "/data/jsquyres/svn/ompi/ ompi/.libs/libmpi.lax/libdatatype.a/copy_functions_heterogeneous.o" - no debug information available for "copy_functions_heterogeneous.c". ----- On Sep 19, 2008, at 2:31 PM, Ralf Wildenhues wrote: * Aurélien Bouteiller wrote on Fri, Sep 19, 2008 at 08:02:40PM CEST: Thanks Ralf for the support. I upgraded to libtool 2.2.6 and it didn't solved the problem though. Still looking for somebody to confirm that its working or not working on their Mac. Did you rerun autogen.sh? All I know is that your report looks really similar to <http://gcc.gnu.org/ml/gcc/2008-08/msg00054.html> and that one is apparently solved with Libtool 2.2.6. If yours is still broken, then some more details would be nice. Cheers, Ralf ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] gdb libmpi.dylib on Leopard
Thanks Ralf for the support. I upgraded to libtool 2.2.6 and it didn't solved the problem though. Still looking for somebody to confirm that its working or not working on their Mac. Aurelien Le 17 sept. 08 à 12:39, Ralf Wildenhues a écrit : Hello Aurélien, * Aurélien Bouteiller wrote on Wed, Sep 17, 2008 at 06:32:11PM CEST: I have been facing a weird problem for several month now (I guess since I upgraded from Tiger to Leopard). I am unable to debug Open MPI using gdb on my mac. The problem comes from gdb not being able to load symbols from the dynamic libraries of Open MPI. I receive a message "warning: Could not find object file "/Users/bouteill/ompi/debug.build/ opal/.libs/libopen-pal.lax/libmca_memory.a/memory_base_close.o" - no debug information available for "../../../../trunk/opal/mca/memory/ base/memory_base_close.c".". As you can see, the path to the object file containing the symbols is not correct. It links to the temporary files expanded during the final stage link. As those files do not exist anymore, gdb gets confused. I have a vague memory that this is fixed in Libtool 2.2.6. If you're using an older version, please retry bootstrapping OpenMPI with that one. Cheers, Ralf ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] gdb libmpi.dylib on Leopard
I have been facing a weird problem for several month now (I guess since I upgraded from Tiger to Leopard). I am unable to debug Open MPI using gdb on my mac. The problem comes from gdb not being able to load symbols from the dynamic libraries of Open MPI. I receive a message "warning: Could not find object file "/Users/bouteill/ompi/debug.build/ opal/.libs/libopen-pal.lax/libmca_memory.a/memory_base_close.o" - no debug information available for "../../../../trunk/opal/mca/memory/ base/memory_base_close.c".". As you can see, the path to the object file containing the symbols is not correct. It links to the temporary files expanded during the final stage link. As those files do not exist anymore, gdb gets confused. supposedly, the rpath option of libtool should take care of this and correct the path to the symbols. Is anybody successful at debugging Open MPI on Leopard ? Is this a bug of Open MPI or a bug in libtool/ gdb ? Any known fix ? Aurelien -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321
Re: [OMPI devel] PLM consistency: priority
We don't want the user to have to select by hand the best PML. The logic inside the current selection process selects the best pml for the underlying network. However changing the priority is pretty meaningless from the user's point of view. So while retaining the selection process including priorities, we might want to remove the priority parameter, and use only the pml=ob1,cm syntax from the user's point of view. Aurelien Le 11 juil. 08 à 10:56, Ralph H Castain a écrit : Okay, another fun one. Some of the PLM modules use MCA params to adjust their relative selection priority. This can lead to very unexpected behavior as which module gets selected will depend on the priorities of the other selectable modules - which changes from release to release as people independently make adjustments and/or new modules are added. Fortunately, this doesn't bite us too often since many environments only support one module, and since there is nothing to tell the user that the plm module whose priority they raised actually -didn't- get used! However, in the interest of "least astonishment", some of us working on the RTE had changed our coding approach to avoid this confusion. Given that we have this nice mca component select logic that takes the specified module - i.e., "-mca plm foo" always yields foo if it can run, or errors out if it can't - then the safest course is to remove MCA params that adjust module priorities and have the user simply tell us which module they want us to use. Do we want to make this consistent, at least in the PLM? Or do you want to leave the user guessing? :-) Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r18804
Thanks Ralph, this fix does the trick. Aurelien Le 3 juil. 08 à 13:53, r...@osl.iu.edu a écrit : Author: rhc Date: 2008-07-03 13:53:37 EDT (Thu, 03 Jul 2008) New Revision: 18804 URL: https://svn.open-mpi.org/trac/ompi/changeset/18804 Log: Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/ accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 Added: trunk/orte/mca/ess/base/ess_base_nidmap.c Removed: trunk/orte/mca/ess/base/ess_base_build_nidmap.c Text files modified: trunk/ompi/attribute/attribute_predefined.c |13 trunk/ompi/mca/dpm/base/base.h | 1 trunk/ompi/mca/dpm/base/dpm_base_null_fns.c | 5 trunk/ompi/mca/dpm/base/dpm_base_open.c | 1 trunk/ompi/mca/dpm/dpm.h| 7 trunk/ompi/mca/dpm/orte/dpm_orte.c | 494 +++ +++- trunk/ompi/mca/pubsub/orte/pubsub_orte.c|14 trunk/ompi/proc/proc.c | 1 trunk/orte/mca/ess/alps/ess_alps_module.c | 163 + trunk/orte/mca/ess/base/Makefile.am | 2 trunk/orte/mca/ess/base/base.h |12 trunk/orte/mca/ess/base/ess_base_get.c | 9 trunk/orte/mca/ess/base/ess_base_put.c | 8 trunk/orte/mca/ess/env/ess_env_module.c | 144 +-- trunk/orte/mca/ess/hnp/ess_hnp_module.c | 2 trunk/orte/mca/ess/lsf/ess_lsf_module.c | 138 +- trunk/orte/mca/ess/singleton/ess_singleton_module.c | 182 ++ +-- trunk/orte/mca/ess/slurm/ess_slurm_module.c | 136 +- trunk/orte/mca/ess/tool/ess_tool_module.c | 2 trunk/orte/mca/grpcomm/bad/grpcomm_bad_module.c |22 + trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c|13 trunk/orte/mca/odls/base/odls_base_default_fns.c|52 ++-- trunk/orte/mca/odls/base/odls_base_open.c | 8 trunk/orte/mca/odls/base/odls_private.h | 4 trunk/orte/mca/rml/base/rml_base_receive.c |21 + trunk/orte/mca/rml/rml_types.h | 2 trunk/orte/mca/routed/binomial/routed_binomial.c| 192 +++ ++-- trunk/orte/mca/routed/direct/routed_direct.c| 316 +++ +++-- trunk/orte/mca/routed/linear/routed_linear.c| 198 +++ ++-- trunk/orte/runtime/orte_globals.h |15 +
Re: [OMPI devel] PML selection logic
The first approach sounds fair enough to me. We should avoid 2 and 3 as the pml selection mechanism used to be more complex before we reduced it to accommodate a major design bug in the BTL selection process. When using the complete PML selection, BTL would be initialized several times, leading to a variety of bugs. Eventually the PML selection should return to its old self, when the BTL bug gets fixed. Aurelien Le 23 juin 08 à 12:36, Ralph H Castain a écrit : Yo all I've been doing further research into the modex and came across something I don't fully understand. It seems we have each process insert into the modex the name of the PML module that it selected. Once the modex has exchanged that info, it then loops across all procs in the job to check their selection, and aborts if any proc picked a different PML module. All well and good...assuming that procs actually -can- choose different PML modules and hence create an "abort" scenario. However, if I look inside the PML's at their selection logic, I find that a proc can ONLY pick a module other than ob1 if: 1. the user specifies the module to use via -mca pml xyz or by using a module specific mca param to adjust its priority. In this case, since the mca param is propagated, ALL procs have no choice but to pick that same module, so that can't cause us to abort (we will have already returned an error and aborted if the specified module can't run). 2. the pml/cm module detects that an MTL module was selected, and that it is other than "psm". In this case, the CM module will be selected because its default priority is higher than that of OB1. In looking deeper into the MTL selection logic, it appears to me that you either have the required capability or you don't. I can see that in some environments (e.g., rsh across unmanaged collections of machines), it might be possible for someone to launch across a set of machines where some do and some don't have the required support. However, in all other cases, this will be homogeneous across the system. Given this analysis (and someone more familiar with the PML should feel free to confirm or correct it), it seems to me that this could be streamlined via one or more means: 1. at the most, we could have rank=0 add the PML module name to the modex, and other procs simply check it against their own and return an error if they differ. This accomplishes the identical functionality to what we have today, but with much less info in the modex. 2. we could eliminate this info from the modex altogether by requiring the user to specify the PML module if they want something other than the default OB1. In this case, there can be no confusion over what each proc is to use. The CM module will attempt to init the MTL - if it cannot do so, then the job will return the correct error and tell the user that CM/MTL support is unavailable. 3. we could again eliminate the info by not inserting it into the modex if (a) the default PML module is selected, or (b) the user specified the PML module to be used. In the first case, each proc can simply check to see if they picked the default - if not, then we can insert the info to indicate the difference. Thus, in the "standard" case, no info will be inserted. In the second case, we will already get an error if the specified PML module could not be used. Hence, the modex check provides no additional info or value. I understand the motivation to support automation. However, in this case, the automation actually doesn't seem to buy us very much, and it isn't coming "free". So perhaps some change in how this is done would be in order? Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] "__printf__" attribute
They refer to the parameters of the function. In the example linked, 2 means the fmt is the second argument of the function and 3 is the first variadic arg related to the fmt string. Aurelien Le 8 mai 08 à 18:24, Jeff Squyres a écrit : Rainer -- What do the numeric arguments refer to in the attribute format stuff? The wiki page has only one example, and it doesn't explain what these numbers are: https://svn.open-mpi.org/trac/ompi/wiki/CompilerAttributes Thanks! -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303
To bounce on last George remark, currently when a job dies without unsubscribing a port with Unpublish(due to poor user programming, failure or abort), ompi-server keeps the reference forever and a new application can therefore not publish under the same name again. So I guess this is a good point to cleanup correctly all published/opened ports, when the application is ended (for whatever reason). Another cool feature could be to have mpirun behave as an ompi-server, and publish a suitable URI if requested to do so (if the urifile does not exist yet ?). I know from the source code that mpirun is already including anything needed to offer this feature, exept the ability to provide a suitable URI. Aurelien Le 25 avr. 08 à 19:19, George Bosilca a écrit : Ralph, Thanks for your concern regarding the level of compliance of our implementation of the MPI standard. I don't know who were the MPI gurus you talked with about this issue, but I can tell that for once the MPI standard is pretty clear about this. As stated by Aurelien in his last email, using the plural in several sentences, strongly suggest that the status of port should not be implicitly modified by MPI_Comm_accept or MPI_Comm_connect. Moreover, in the beginning of the chapter in the MPI standard, it is specified that comm/accept work exactly as in TCP. In other words, once the port is opened it stay open until the user explicitly close it. However, not all corner cases are addressed by the MPI standard. What happens on MPI_Finalize ... it's a good question. Personally, I think we should stick with the TCP similarities. The port should be not only closed by unpublished. This will solve all issues with people trying to lookup a port once the originator is gone. george. On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote: As I said, it makes no difference to me. I just want to ensure that everyone agrees on the interpretation of the MPI standard. We have had these discussion in the past, with differing views. My guess here is that the port was left open mostly because the person who wrote the C-binding forgot to close it. ;-) So, you MPI folks: do we allow multiple connections against a single port, and leave the port open until explicitly closed? If so, then do we generate an error if someone calls MPI_Finalize without first closing the port? Or do we automatically close any open ports when finalize is called? Or do we automatically close the port after the connect/accept is completed? Thanks Ralph On 4/25/08 3:13 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: Actually, the port was still left open forever before the change. The bug damaged the port string, and it was not usable anymore, not only in subsequent Comm_accept, but also in Close_port or Unpublish_name. To more specifically answer to your open port concern, if the user does not want to have an open port anymore, he should specifically call MPI_Close_port and not rely on MPI_Comm_accept to close it. Actually the standard suggests the exact contrary: section 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". Because there is multiple clients AND multiple connections in that sentence, I assume the port can be used in multiple accepts. Aurelien Le 25 avr. 08 à 16:53, Ralph Castain a écrit : Hmmm...just to clarify, this wasn't a "bug". It was my understanding per the MPI folks that a separate, unique port had to be created for every invocation of Comm_accept. They didn't want a port hanging around open, and their plan was to close the port immediately after the connection was established. So dpm_orte was written to that specification. When I reorganized the code, I left the logic as it had been written - which was actually done by the MPI side of the house, not me. I have no problem with making the change. However, since the specification was created on the MPI side, I just want to make sure that the MPI folks all realize this has now been changed. Obviously, if this change in spec is adopted, someone needs to make sure that the C and Fortran bindings - do not- close that port any more! Ralph On 4/25/08 2:41 PM, "boute...@osl.iu.edu" <boute...@osl.iu.edu> wrote: Author: bouteill Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008) New Revision: 18303 URL: https://svn.open-mpi.org/trac/ompi/changeset/18303 Log: Fix a bug that rpevented to use the same port (as returned by Open_port) for several Comm_accept) Text files modified: trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++- 1 files changed, 10 insertions(+), 9 deletions(-) Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c = = = = = = = = = = = = == --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original) +++ tr
Re: [OMPI devel] MPI_Comm_connect/Accept
Still no luck here, I launch those three processes : term1$ ompi-server -d --report-uri URIFILE term2$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_accept term3$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1 simple_connect The output of ompi-server shows a successful publish and lookup. I get the correct port on the client side. However, the result is the same as when not using the Publish/Lookup mechanism: the connect fails saying the port cannot be reached. Found port < 1940389889.0;tcp:// 160.36.252.99:49777;tcp6://2002:a024:ed65:9:21b:63ff:fecb: 28:49778;tcp6://fec0::9:21b:63ff:fecb:28:49778;tcp6://2002:a024:ff7f: 9:21b:63ff:fecb:28:49778:300 > [abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../trunk/orte/mca/rml/oob/rml_oob_send.c at line 140 [abouteil.nomad.utk.edu:60339] [[29620,1],0] attempted to send to [[29608,1],0] [abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../trunk/ompi/mca/dpm/orte/dpm_orte.c at line 455 [abouteil.nomad.utk.edu:60339] *** An error occurred in MPI_Comm_connect [abouteil.nomad.utk.edu:60339] *** on communicator MPI_COMM_SELF [abouteil.nomad.utk.edu:60339] *** MPI_ERR_UNKNOWN: unknown error [abouteil.nomad.utk.edu:60339] *** MPI_ERRORS_ARE_FATAL (goodbye) I took a look in the source code, and I think the problem comes from a conceptional mistake in MPI_Connect. The function "connect_accept" in dpm_orte.c takes a orte_process_name_t as the destination port. This structure only contains the jobid and the vpid (always set to 0, I guess meaning you plan to contact the HNP of that job). Obviously, if the accepting process does not share the same HNP with the connecting process, there is no way for the MPI_Comm_connect function to fill correctly this field. The all purpose of the port_name string is to provide a consistent way to access the remote endpoint without a complicated name resolution service. I think this function should take the port_name instead (the string returned by open_port) and contact directly with OOB this endpoint to get the contact informations it needs from there, and not from the local HNP. Aurelien Le 4 avr. 08 à 15:21, Ralph H Castain a écrit : Okay, I have a partial fix in there now. You'll have to use -mca routed unity as I still need to fix it for routed tree. Couple of things: 1. I fixed the --debug flag so it automatically turns on the debug output from the data server code itself. Now ompi-server will tell you when it is accessed. 2. remember, we added an MPI_Info key that specifies if you want the data stored locally (on your own mpirun) or globally (on the ompi- server). If you specify nothing, there is a precedence built into the code that defaults to "local". So you have to tell us that this data is to be published "global" if you want to connect multiple mpiruns. I believe Jeff wrote all that up somewhere - could be in an email thread, though. Been too long ago for me to remember... ;-) You can look it up in the code though as a last resort - it is in ompi/mca/pubsub/orte/pubsub_orte.c. Ralph On 4/4/08 12:55 PM, "Ralph H Castain" <r...@lanl.gov> wrote: Well, something got borked in here - will have to fix it, so this will probably not get done until next week. On 4/4/08 12:26 PM, "Ralph H Castain" <r...@lanl.gov> wrote: Yeah, you didn't specify the file correctly...plus I found a bug in the code when I looked (out-of-date a little in orterun). I am updating orterun (commit soon) and will include a better help message about the proper format of the orterun cmd-line option. The syntax is: -ompi-server uri or -ompi-server file:filename-where-uri-exists Problem here is that you gave it a uri of "test", which means nothing. ;-) Should have it up-and-going soon. Ralph On 4/4/08 12:02 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: Ralph, I've not been very successful at using ompi-server. I tried this : xterm1$ ompi-server --debug-devel -d --report-uri test [grosse-pomme.local:01097] proc_info: hnp_uri NULL daemon uri NULL [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running! xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test Port name: 2285895681.0;tcp://192.168.0.101:50065;tcp:// 192.168.0.150:50065:300 xterm3$ mpirun -ompi-server test -np 1 simple_connect -- Process rank 0 attempted to lookup from a global ompi_server that could not be contacted. This is typically caused by either not specifying the contact info for the server, or by the server not currently ex
Re: [OMPI devel] MPI_Comm_connect/Accept
Ralph, I've not been very successful at using ompi-server. I tried this : xterm1$ ompi-server --debug-devel -d --report-uri test [grosse-pomme.local:01097] proc_info: hnp_uri NULL daemon uri NULL [grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running! xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test Port name: 2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300 xterm3$ mpirun -ompi-server test -np 1 simple_connect -- Process rank 0 attempted to lookup from a global ompi_server that could not be contacted. This is typically caused by either not specifying the contact info for the server, or by the server not currently executing. If you did specify the contact info for a server, please check to see that the server is running and start it again (or have your sys admin start it) if it isn't. -- [grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name [grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD [grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument [grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye) -- The server code Open_port, and then PublishName. Looks like the LookupName function cannot reach the ompi-server. The ompi-server in debug mode does not show any output when a new event occurs (like when the server is launched). Is there something wrong in the way I use it ? Aurelien Le 3 avr. 08 à 17:21, Ralph Castain a écrit : Take a gander at ompi/tools/ompi-server - I believe I put a man page in there. You might just try "man ompi-server" and see if it shows up. Holler if you have a question - not sure I documented it very thoroughly at the time. On 4/3/08 3:10 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: Ralph, I am using trunk. Is there a documentation for ompi-server ? Sounds exactly like what I need to fix point 1. Aurelien Le 3 avr. 08 à 17:06, Ralph Castain a écrit : I guess I'll have to ask the basic question: what version are you using? If you are talking about the trunk, there no longer is a "universe" concept anywhere in the code. Two mpiruns can connect/accept to each other as long as they can make contact. To facilitate that, we created an "ompi- server" tool that is supposed to be run by the sys-admin (or a user, doesn't matter which) on the head node - there are various ways to tell mpirun how to contact the server, or it can self-discover it. I have tested publish/lookup pretty thoroughly and it seems to work. I haven't spent much time testing connect/accept except via comm_spawn, which seems to be working. Since that uses the same mechanism, I would have expected connect/accept to work as well. If you are talking about 1.2.x, then the story is totally different. Ralph On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: Hi everyone, I'm trying to figure out how complete is the implementation of Comm_connect/Accept. I found two problematic cases. 1) Two different programs are started in two different mpirun. One makes accept, the second one use connect. I would not expect MPI_Publish_name/Lookup_name to work because they do not share the HNP. Still I would expect to be able to connect by copying (with printf-scanf) the port_name string generated by Open_port; especially considering that in Open MPI, the port_name is a string containing the tcp address and port of the rank 0 in the server communicator. However, doing so results in "no route to host" and the connecting application aborts. Is the problem related to an explicit check of the universes on the accept HNP ? Do I expect too much from the MPI standard ? Is it because my two applications does not share the same universe ? Should we (re) add the ability to use the same universe for several mpirun ? 2) Second issue is when the program setup a port, and then accept multiple clients on this port. Everything works fine for the first client, and then accept stalls forever when waiting for the second one. My understanding of the standard is that it should work: 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". I understand that for one MPI_Open_port I should be able to manage several MPI clients. Am I understanding correctly the standard here and should we fix this ? Here is a copy of the non-working code for reference. /* * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #include #include #include int main(int argc, char
Re: [OMPI devel] MPI_Comm_connect/Accept
Ralph, I am using trunk. Is there a documentation for ompi-server ? Sounds exactly like what I need to fix point 1. Aurelien Le 3 avr. 08 à 17:06, Ralph Castain a écrit : I guess I'll have to ask the basic question: what version are you using? If you are talking about the trunk, there no longer is a "universe" concept anywhere in the code. Two mpiruns can connect/accept to each other as long as they can make contact. To facilitate that, we created an "ompi- server" tool that is supposed to be run by the sys-admin (or a user, doesn't matter which) on the head node - there are various ways to tell mpirun how to contact the server, or it can self-discover it. I have tested publish/lookup pretty thoroughly and it seems to work. I haven't spent much time testing connect/accept except via comm_spawn, which seems to be working. Since that uses the same mechanism, I would have expected connect/accept to work as well. If you are talking about 1.2.x, then the story is totally different. Ralph On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: Hi everyone, I'm trying to figure out how complete is the implementation of Comm_connect/Accept. I found two problematic cases. 1) Two different programs are started in two different mpirun. One makes accept, the second one use connect. I would not expect MPI_Publish_name/Lookup_name to work because they do not share the HNP. Still I would expect to be able to connect by copying (with printf-scanf) the port_name string generated by Open_port; especially considering that in Open MPI, the port_name is a string containing the tcp address and port of the rank 0 in the server communicator. However, doing so results in "no route to host" and the connecting application aborts. Is the problem related to an explicit check of the universes on the accept HNP ? Do I expect too much from the MPI standard ? Is it because my two applications does not share the same universe ? Should we (re) add the ability to use the same universe for several mpirun ? 2) Second issue is when the program setup a port, and then accept multiple clients on this port. Everything works fine for the first client, and then accept stalls forever when waiting for the second one. My understanding of the standard is that it should work: 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". I understand that for one MPI_Open_port I should be able to manage several MPI clients. Am I understanding correctly the standard here and should we fix this ? Here is a copy of the non-working code for reference. /* * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #include #include #include int main(int argc, char *argv[]) { char port[MPI_MAX_PORT_NAME]; int rank; int np; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); if(rank) { MPI_Comm comm; /* client */ MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Read port: %s\n", port); MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, ); MPI_Send(, 1, MPI_INT, 0, 1, comm); MPI_Comm_disconnect(); } else { int nc = np - 1; MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, sizeof(MPI_Comm)); MPI_Request *reqs = (MPI_Request *) calloc(nc, sizeof(MPI_Request)); int *event = (int *) calloc(nc, sizeof(int)); int i; MPI_Open_port(MPI_INFO_NULL, port); /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ printf("Port name: %s\n", port); for(i = 1; i < np; i++) MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, MPI_COMM_WORLD); for(i = 0; i < nc; i++) { MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, _nodes[i]); printf("Accept %d\n", i); MPI_Irecv([i], 1, MPI_INT, 0, 1, comm_nodes[i], [i]); printf("IRecv %d\n", i); } MPI_Close_port(port); MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); for(i = 0; i < nc; i++) { printf("event[%d] = %d\n", i, event[i]); MPI_Comm_disconnect(_nodes[i]); printf("Disconnect %d\n", i); } } MPI_Finalize(); return EXIT_SUCCESS; } -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 ___ devel mailing list de...@open-mpi.o
[OMPI devel] MPI_Comm_connect/Accept
Hi everyone, I'm trying to figure out how complete is the implementation of Comm_connect/Accept. I found two problematic cases. 1) Two different programs are started in two different mpirun. One makes accept, the second one use connect. I would not expect MPI_Publish_name/Lookup_name to work because they do not share the HNP. Still I would expect to be able to connect by copying (with printf-scanf) the port_name string generated by Open_port; especially considering that in Open MPI, the port_name is a string containing the tcp address and port of the rank 0 in the server communicator. However, doing so results in "no route to host" and the connecting application aborts. Is the problem related to an explicit check of the universes on the accept HNP ? Do I expect too much from the MPI standard ? Is it because my two applications does not share the same universe ? Should we (re) add the ability to use the same universe for several mpirun ? 2) Second issue is when the program setup a port, and then accept multiple clients on this port. Everything works fine for the first client, and then accept stalls forever when waiting for the second one. My understanding of the standard is that it should work: 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". I understand that for one MPI_Open_port I should be able to manage several MPI clients. Am I understanding correctly the standard here and should we fix this ? Here is a copy of the non-working code for reference. /* * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #include #include #include int main(int argc, char *argv[]) { char port[MPI_MAX_PORT_NAME]; int rank; int np; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); if(rank) { MPI_Comm comm; /* client */ MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Read port: %s\n", port); MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, ); MPI_Send(, 1, MPI_INT, 0, 1, comm); MPI_Comm_disconnect(); } else { int nc = np - 1; MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, sizeof(MPI_Comm)); MPI_Request *reqs = (MPI_Request *) calloc(nc, sizeof(MPI_Request)); int *event = (int *) calloc(nc, sizeof(int)); int i; MPI_Open_port(MPI_INFO_NULL, port); /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ printf("Port name: %s\n", port); for(i = 1; i < np; i++) MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, MPI_COMM_WORLD); for(i = 0; i < nc; i++) { MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, _nodes[i]); printf("Accept %d\n", i); MPI_Irecv([i], 1, MPI_INT, 0, 1, comm_nodes[i], [i]); printf("IRecv %d\n", i); } MPI_Close_port(port); MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); for(i = 0; i < nc; i++) { printf("event[%d] = %d\n", i, event[i]); MPI_Comm_disconnect(_nodes[i]); printf("Disconnect %d\n", i); } } MPI_Finalize(); return EXIT_SUCCESS; } -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321
Re: [OMPI devel] Fault tolerance
We now use the errmgr. Aurelien Le 6 mars 08 à 13:38, Aurélien Bouteiller a écrit : Aside of what Josh said, we are working right know at UTK on orted/MPI recovery (without killing/respawning all). For now we had no use of the errgmr, but I'm quite sure this would be the smartest place to put all the mechanisms we are trying now. Aurelien Le 6 mars 08 à 11:17, Ralph Castain a écrit : Ah - ok, thanks for clarifying! I'm happy to leave it around, but wasn't sure if/where it fit into anyone's future plans. Thanks Ralph On 3/6/08 9:13 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote: The checkpoint/restart work that I have integrated does not respond to failure at the moment. If a failures happens I want ORTE to terminate the entire job. I will then restart the entire job from a checkpoint file. This follows the 'all fall down' approach that users typically expect when using a global C/R technique. Eventually I want to integrate something better where I can respond to a failure with a recovery from inside ORTE. I'm not there yet, but hopefully in the near future. I'll let the UTK group talk about what they are doing with ORTE, but I suspect they will be taking advantage of the errmgr to help respond to failure and restart a single process. It is important to consider in this context that we do *not* always want ORTE to abort whenever it detects a process failure. This is the default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should be supported. But there is another mode in which we would like ORTE to keep running to conform with (MPI_ERRORS_RETURN): http://www.mpi-forum.org/docs/mpi-11-html/node148.html It is known that certain standards conformant MPI "fault tolerant" programs do not work in Open MPI for various reasons some in the runtime and some external. Here we are mostly talking about disconnected fates of intra-communicator groups. I have a test in the ompi-tests repository that illustrates this problem, but I do not have time to fix it at the moment. So in short keep the errmgr around for now. I suspect we will be using it, and possibly tweaking it in the nearish future. Thanks for the observation. Cheers, Josh On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote: Hello I've been doing some work on fault response within the system, and finally realized something I should probably have seen awhile back. Perhaps I am misunderstanding somewhere, so forgive the ignorance if so. When we designed ORTE some time in the deep, dark past, we had envisioned that people might want multiple ways of responding to process faults and/or abnormal terminations. You might want to just abort the job, attempt to restart just that proc, attempt to restart the job, etc. To support these multiple options, and to provide a means for people to simply try new ones, we created the errmgr framework. Our thought was that a process and/or daemon would call the errmgr when we detected something abnormal happening, and that the selected errmgr component could then do whatever fault response was desired. However, I now see that the fault tolerance mechanisms inside of OMPI do not seem to be using that methodology. Instead, we have hard-coded a particular response into the system. If we configure without FT, we just abort the entire job since that is the only errmgr component that exists. If we configure with FT, then we execute the hard-coded C/R methodology. This is built directly into the code, so there is no option as to what happens. Is there a reason why the errmgr framework was not used? Did the FT team decide that this was not a useful tool to support multiple FT strategies? Can we modify it to better serve those needs, or is it simply not feasible? If it isn't going to be used for that purpose, then I might as well remove it. As things stand, there really is no purpose served by the errmgr framework - might as well replace it with just a function call. Appreciate any insights Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OMPI and Mac Leopard
Trunk works fine in Leopard in both static and dso build. Didn't tried the tmp branch on Leopard tough. Aurelien Le 22 févr. 08 à 23:17, Ralph Castain a écrit : I have confirmed that my tmp branch now builds and works on the Mac Leopard OS, at least on an Intel arch. It is really critical, however, that you don't try to build statically on that system (trust me - hard experience). I believe the trunk and older versions are having some problems under Leopard. I haven't fully confirmed that, though I did see some strange behavior on my test machine here, so it may not be entirely accurate. I am waiting for just a couple of checks to be completed before merging the branch to the trunk. Hopefully, the appropriate people will have a chance to finish those checks over the next few days so we can do the merge next week. Will keep you posted. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] PML V will be enabled again
Hi everyone, All the problems detected last time PML V has been enabled in trunk have been fixed. We invite you to give it a try (add a .ompi_unignore in ompi/mca/pml/v) on your favorite platform and compilation options and report any issues you may encounter. If none are detected, we plan to remove the ignore tag on wed. feb. 6. Thanks, Aurelien -- Dr. Aurélien Bouteiller Sr. Research Associate - Innovative Computing Laboratory Suite 350, 1122 Volunteer Boulevard Knoxville, TN 37996 865 974 6321
Re: [OMPI devel] orte_ns_base_select failed: returned value -1 instead of ORTE_SUCCESS
I tried using a fresh trunk, same problem have occured. Here is the complete configure line. I am using libtool 1.5.22 from fink. Otherwise everything is standard OS 10.5. $ ../trunk/configure --prefix=/Users/bouteill/ompi/build --enable- mpirun-prefix-by-default --disable-io-romio --enable-debug --enable- picky --enable-mem-debug --enable-mem-profile --enable-visibility -- disable-dlopen --disable-shared --enable-static The error message generated by abort contains garbage (line numbers do not match anything in .c files and according to gdb the failure does not occur during ns initialization). This looks like a heap corruption or something as bad. orterun (argc=4, argv=0xb81c) at ../../../../trunk/orte/tools/ orterun/orterun.c:529 529 cb_states = ORTE_PROC_STATE_TERMINATED | ORTE_PROC_STATE_AT_STG1; (gdb) n 530 rc = orte_rmgr.spawn_job(apps, num_apps, , 0, NULL, job_state_callback, cb_states, ); (gdb) n 531 while (NULL != (item = opal_list_remove_first())) OBJ_RELEASE(item); (gdb) n ** Stepping over inlined function code. ** 532 OBJ_DESTRUCT(); (gdb) n 534 if (orterun_globals.do_not_launch) { (gdb) n 539 OPAL_THREAD_LOCK(_globals.lock); (gdb) n 541 if (ORTE_SUCCESS == rc) { (gdb) n 542 while (!orterun_globals.exit) { (gdb) n 543 opal_condition_wait(_globals.cond, (gdb) n [grosse-pomme.local:77335] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/ oob_base_init.c at line 74 Aurelien Le 30 janv. 08 à 17:18, Ralph Castain a écrit : Are you running on the trunk, or an earlier release? If the trunk, then I suspect you have a stale library hanging around. I build and run statically on Leopard regularly. On 1/30/08 2:54 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote: I get a runtime error in static build on Mac OS 10.5 (automake 1.10, autoconf 2.60, gcc-apple-darwin 4.01, libtool 1.5.22). The error does not occur in dso builds, and everything seems to work fine on Linux. Here is the error log. ~/ompi$ mpirun -np 2 NetPIPE_3.6/NPmpi [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/ oob_base_init.c at line 74 [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ns/proxy/ ns_proxy_component.c at line 222 [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Error in file / SourceCache/openmpi/openmpi-5/openmpi/orte/runtime/orte_init_stage1.c at line 230 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ns_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init_stage1 failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Dr. Aurélien Bouteiller Sr. Research Associate - Innovative Computing Laboratory Suite 350, 1122 Volunteer Boulevard Knoxville, TN 37996 865 974 6321 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] orte_ns_base_select failed: returned value -1 instead of ORTE_SUCCESS
I get a runtime error in static build on Mac OS 10.5 (automake 1.10, autoconf 2.60, gcc-apple-darwin 4.01, libtool 1.5.22). The error does not occur in dso builds, and everything seems to work fine on Linux. Here is the error log. ~/ompi$ mpirun -np 2 NetPIPE_3.6/NPmpi [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/ oob_base_init.c at line 74 [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ns/proxy/ ns_proxy_component.c at line 222 [grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Error in file / SourceCache/openmpi/openmpi-5/openmpi/orte/runtime/orte_init_stage1.c at line 230 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ns_base_select failed --> Returned value -1 instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init_stage1 failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Dr. Aurélien Bouteiller Sr. Research Associate - Innovative Computing Laboratory Suite 350, 1122 Volunteer Boulevard Knoxville, TN 37996 865 974 6321
Re: [OMPI devel] RES: v pml question
I just agree with Josh. We though about it a bit, and nothing should prevent to use both. Aurelien Le 29 janv. 08 à 15:01, Josh Hursey a écrit : At the moment I do not plan on joining the crcpw and v_protocol. However those two components may currently work just fine together. They are both designed to wrap around whatever the 'selected' PML happens to be. If you tried to do this, I would expect the PML call stack to look something like the following: PML_SEND -> v_protocol -> crcpw -> ob1/cm But since I have not tried this out I cannot say for sure. Let us know if you have any problems. Cheers, Josh On Jan 23, 2008, at 4:55 PM, Leonardo Fialho wrote: I'm testing the v protocol just now. Anybody have plans to do a message wrapper mixing crcpw and v_protocol? Leonardo Fialho University Autonoma of Barcelona ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Trunk borked
DSO build also fail. ../../../../../../trunk/ompi/contrib/vt/vt/vtlib/vt_comp_gnu.c:312:5: warning: "VT_BFD" is not defined ../../../../../../trunk/ompi/contrib/vt/vt/vtlib/vt_comp_gnu.c:312:5: warning: "VT_BFD" is not defined /usr/bin/ld: cannot find -lz collect2: ld returned 1 exit status make[6]: *** [vtfilter] Error 1 Le 29 janv. 08 à 01:51, George Bosilca a écrit : Look like VT do not correctly compute dependencies. A static build will fails if libz.a is not installed on the system. /usr/bin/ld: cannot find -lz collect2: ld returned 1 exit status make[5]: *** [vtfilter] Error 1 george. On Jan 28, 2008, at 12:37 PM, Matthias Jurenz wrote: Hello, this problem should be fixed now... It seems that the symbol '__pos' is not available on every platform. This isn't a problem, because it's only used for a debug control message. Regards, Matthias On Mo, 2008-01-28 at 09:41 -0500, Jeff Squyres wrote: Doh - this is Solaris on x86? I think Terry said Solaris/sparc was tested... VT guys -- can you check out what's going on? On Jan 28, 2008, at 9:36 AM, Adrian Knoth wrote: > On Mon, Jan 28, 2008 at 07:26:56AM -0700, Ralph H Castain wrote: > >> We seem to have a problem on the trunk this morning. I am building >> on a > > There are more errors: > > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function > `fsetpos': > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:850: error: request > for member `__pos' in something not a structure or union > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function > `fsetpos64': > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:876: error: request > for member `__pos' in something not a structure or union > gmake[5]: *** [vt_iowrap.o] Error 1 > gmake[5]: Leaving directory > `/tmp/ompi/build/SunOS-i86pc/ompi/ompi/contrib/vt/vt/vtlib' > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function > `fsetpos': > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:850: error: request > for member `__pos' in something not a structure or union > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function > `fsetpos64': > /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:876: error: request > for member `__pos' in something not a structure or union > gmake[5]: *** [vt_iowrap.o] Error 1 > gmake[5]: Leaving directory > `/tmp/ompi/build/SunOS-i86pc/ompi/ompi/contrib/vt/vt/vtlib' > > > Just my $0.02 > > -- > Cluster and Metacomputing Working Group > Friedrich-Schiller-Universität Jena, Germany > > private: http://adi.thur.de > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Matthias Jurenz, Center for Information Services and High Performance Computing (ZIH), TU Dresden, Willersbau A106, Zellescher Weg 12, 01062 Dresden phone +49-351-463-31945, fax +49-351-463-37773 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: === CREATE FAILURE ===
According to posix, tar should not limit the file name length. Only the v7 implementation of tar is limited to 99 characters. GNU tar has never been limited in the number of characters file names can have. You should check with tar --help that tar on your machine defaults to format=gnu or format=posix. If it defaults to format=v7 I am curious why. Are you using solaris ? Aurelien Le 24 janv. 08 à 15:18, Jeff Squyres a écrit : I'm trying to replicate and getting a lot of these: tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/ pessimist/vprotocol_pessimist_sender_based.c: file name is too long (max 99); not dumped tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/ pessimist/vprotocol_pessimist_component.c: file name is too long (max 99); not dumped I'll bet that this is the real problem. GNU tar on linux defaults to 99 characters max, and the _component.c filename is 102, for example. Can you shorten your names? On Jan 24, 2008, at 3:02 PM, George Bosilca wrote: We cannot reproduce this one. A simple "make checkdist" exit long before doing anything in the ompi directory. It is difficult to see where exactly it fails, but it is somewhere in the opal directory. I suspect the new carto framework ... Thanks, george. On Jan 24, 2008, at 7:12 AM, Jeff Squyres wrote: Aurelien -- Can you fix please? Last night's tests didn't run because of this failure. Begin forwarded message: From: MPI TeamDate: January 23, 2008 9:13:30 PM EST To: test...@open-mpi.org Subject: === CREATE FAILURE === Reply-To: de...@open-mpi.org ERROR: Command returned a non-zero exist status make -j 4 distcheck Start time: Wed Jan 23 21:00:08 EST 2008 End time: Wed Jan 23 21:13:30 EST 2008 = = = = === [... previous lines snipped ...] config.status: creating orte/mca/snapc/Makefile config.status: creating orte/mca/snapc/full/Makefile config.status: creating ompi/mca/allocator/Makefile config.status: creating ompi/mca/allocator/basic/Makefile config.status: creating ompi/mca/allocator/bucket/Makefile config.status: creating ompi/mca/bml/Makefile config.status: creating ompi/mca/bml/r2/Makefile config.status: creating ompi/mca/btl/Makefile config.status: creating ompi/mca/btl/gm/Makefile config.status: creating ompi/mca/btl/mx/Makefile config.status: creating ompi/mca/btl/ofud/Makefile config.status: creating ompi/mca/btl/openib/Makefile config.status: creating ompi/mca/btl/portals/Makefile config.status: creating ompi/mca/btl/sctp/Makefile config.status: creating ompi/mca/btl/self/Makefile config.status: creating ompi/mca/btl/sm/Makefile config.status: creating ompi/mca/btl/tcp/Makefile config.status: creating ompi/mca/btl/udapl/Makefile config.status: creating ompi/mca/coll/Makefile config.status: creating ompi/mca/coll/basic/Makefile config.status: creating ompi/mca/coll/inter/Makefile config.status: creating ompi/mca/coll/self/Makefile config.status: creating ompi/mca/coll/sm/Makefile config.status: creating ompi/mca/coll/tuned/Makefile config.status: creating ompi/mca/common/Makefile config.status: creating ompi/mca/common/mx/Makefile config.status: creating ompi/mca/common/portals/Makefile config.status: creating ompi/mca/common/sm/Makefile config.status: creating ompi/mca/crcp/Makefile config.status: creating ompi/mca/crcp/coord/Makefile config.status: creating ompi/mca/io/Makefile config.status: creating ompi/mca/io/romio/Makefile config.status: creating ompi/mca/mpool/Makefile config.status: creating ompi/mca/mpool/rdma/Makefile config.status: creating ompi/mca/mpool/sm/Makefile config.status: creating ompi/mca/mtl/Makefile config.status: creating ompi/mca/mtl/mx/Makefile config.status: creating ompi/mca/mtl/portals/Makefile config.status: creating ompi/mca/mtl/psm/Makefile config.status: creating ompi/mca/osc/Makefile config.status: creating ompi/mca/osc/pt2pt/Makefile config.status: creating ompi/mca/osc/rdma/Makefile config.status: creating ompi/mca/pml/Makefile config.status: creating ompi/mca/pml/cm/Makefile config.status: creating ompi/mca/pml/crcpw/Makefile config.status: creating ompi/mca/pml/dr/Makefile config.status: creating ompi/mca/pml/ob1/Makefile config.status: creating ompi/mca/pml/v/vprotocol/Makefile config.status: error: cannot find input file: ompi/mca/pml/v/ vprotocol/pessimist/Makefile.in make: *** [distcheck] Error 1 = = = = === Your friendly daemon, Cyrador ___ testing mailing list test...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/testing -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org
Re: [OMPI devel] RES: v pml question
Hi, Actually it might already work. We never tried yet but nothing should prevent it. The symlinks are necessary to trick the autogen and configure stages. This is required to avoid code replication from autogen.sh. If you look carefully you will see that the simlinks are created only inside the build directory, and not in the source directory. Thus it does not help to add them to the trunk. Aurelien Le 23 janv. 08 à 16:55, Leonardo Fialho a écrit : I'm testing the v protocol just now. Anybody have plans to do a message wrapper mixing crcpw and v_protocol? Leonardo Fialho University Autonoma of Barcelona -Mensagem original- De: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] Em nome de Jeff Squyres Enviada em: miércoles, 23 de enero de 2008 22:45 Para: Open Developers Assunto: [OMPI devel] v pml question Just curious: what are the "mca" and "vprotocol" symlinks to "." for in the v/vprotocol directory for? If they're necessary, can they be committed to svn? If they're not necessary, can they be removed? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk breakage
Should be fixed with r17184. Thanks for the quick bug report ! Aurelien Le 23 janv. 08 à 14:08, Jeff Squyres a écrit : The vprotocol pml does not compile for me. make[4]: Entering directory `/home/jsquyres/svn/ompi2/ompi/mca/pml/v/ vprotocol/pessimist' /bin/sh ../../../../../../libtool --tag=CC --mode=compile gcc - DHAVE_CONFIG_H -I. -I../../../../../../opal/include - I../../../../../../orte/include -I../../../../../../ompi/include - I../../../../../../opal/mca/paffinity/linux/plpa/src/libplpa - I../../../../../..-g -Wall -Wundef -Wno-long-long -Wsign-compare - Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror- implicit-function-declaration -finline-functions -fno-strict- aliasing - pthread -MT mca_vprotocol_pessimist_la- vprotocol_pessimist_sender_based.lo -MD -MP -MF .deps/ mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.Tpo -c -o mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.lo `test - f 'vprotocol_pessimist_sender_based.c' || echo './'`vprotocol_pessimist_sender_based.c libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../../../opal/ include -I../../../../../../orte/include -I../../../../../../ompi/ include -I../../../../../../opal/mca/paffinity/linux/plpa/src/ libplpa - I../../../../../.. -g -Wall -Wundef -Wno-long-long -Wsign-compare - Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror- implicit-function-declaration -finline-functions -fno-strict- aliasing - pthread -MT mca_vprotocol_pessimist_la- vprotocol_pessimist_sender_based.lo -MD -MP -MF .deps/ mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.Tpo -c vprotocol_pessimist_sender_based.c -fPIC -DPIC -o .libs/ mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.o vprotocol_pessimist_sender_based.c: In function `sb_mmap_alloc': vprotocol_pessimist_sender_based.c:94: error: `MAP_NOCACHE' undeclared (first use in this function) vprotocol_pessimist_sender_based.c:94: error: (Each undeclared identifier is reported only once vprotocol_pessimist_sender_based.c:94: error: for each function it appears in.) make[4]: *** [mca_vprotocol_pessimist_la- vprotocol_pessimist_sender_based.lo] Error 1 make[4]: Leaving directory `/home/jsquyres/svn/ompi2/ompi/mca/pml/v/ vprotocol/pessimist' make[3]: *** [all-recursive] Error 1 On Jan 23, 2008, at 12:27 PM, boute...@osl.iu.edu wrote: Author: bouteill Date: 2008-01-23 12:27:23 EST (Wed, 23 Jan 2008) New Revision: 17182 URL: https://svn.open-mpi.org/trac/ompi/changeset/17182 Log: removed ignore, as the code is robust enough to avoid interfering with others Removed: trunk/ompi/mca/pml/v/.ompi_ignore trunk/ompi/mca/pml/v/.ompi_unignore Deleted: trunk/ompi/mca/pml/v/.ompi_ignore = = = = = = = = = = Deleted: trunk/ompi/mca/pml/v/.ompi_unignore = = = = = = = = = = --- trunk/ompi/mca/pml/v/.ompi_unignore 2008-01-23 12:27:23 EST (Wed, 23 Jan 2008) +++ (empty file) @@ -1 +0,0 @@ -bouteill ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17177
Undefined symbols: "_opal_carto_base_components_opened", referenced from: _opal_carto_base_components_opened$non_lazy_ptr in components.o "_opal_carto_base_open", referenced from: ompi_info::open_components() in components.o "_opal_carto_base_close", referenced from: ompi_info::close_components() in components.o ld: symbol(s) not found collect2: ld returned 1 exit status make[3]: *** [ompi_info] Error 1 I think you forgot one file in Makefile.am ;) Aurelien Le 23 janv. 08 à 04:20, shar...@osl.iu.edu a écrit : Author: sharonm Date: 2008-01-23 04:20:34 EST (Wed, 23 Jan 2008) New Revision: 17177 URL: https://svn.open-mpi.org/trac/ompi/changeset/17177 Log: Move the carto framework to the trunk. Added: trunk/opal/class/opal_graph.c (contents, props changed) trunk/opal/class/opal_graph.h (contents, props changed) trunk/opal/mca/carto/ trunk/opal/mca/carto/Makefile.am (contents, props changed) trunk/opal/mca/carto/auto_detect/ trunk/opal/mca/carto/auto_detect/Makefile.am trunk/opal/mca/carto/auto_detect/carto_auto_detect.h trunk/opal/mca/carto/auto_detect/carto_auto_detect_component.c trunk/opal/mca/carto/auto_detect/carto_auto_detect_module.c trunk/opal/mca/carto/auto_detect/configure.params trunk/opal/mca/carto/base/ trunk/opal/mca/carto/base/Makefile.am (contents, props changed) trunk/opal/mca/carto/base/base.h (contents, props changed) trunk/opal/mca/carto/base/carto_base_close.c (contents, props changed) trunk/opal/mca/carto/base/carto_base_graph.c trunk/opal/mca/carto/base/carto_base_graph.h trunk/opal/mca/carto/base/carto_base_open.c (contents, props changed) trunk/opal/mca/carto/base/carto_base_select.c (contents, props changed) trunk/opal/mca/carto/base/static-components.h (contents, props changed) trunk/opal/mca/carto/carto.h (contents, props changed) trunk/opal/mca/carto/file/ trunk/opal/mca/carto/file/Makefile.am (contents, props changed) trunk/opal/mca/carto/file/carto_file.h (contents, props changed) trunk/opal/mca/carto/file/carto_file_component.c (contents, props changed) trunk/opal/mca/carto/file/carto_file_lex.c trunk/opal/mca/carto/file/carto_file_lex.h trunk/opal/mca/carto/file/carto_file_lex.l trunk/opal/mca/carto/file/carto_file_module.c (contents, props changed) trunk/opal/mca/carto/file/configure.params (contents, props changed) trunk/opal/mca/carto/file/help-opal-carto-file.txt trunk/test/carto/ trunk/test/carto/carto-file trunk/test/carto/carto_test.c Text files modified: trunk/ompi/runtime/ompi_mpi_finalize.c | 3 +++ trunk/ompi/runtime/ompi_mpi_init.c |11 +++ trunk/ompi/tools/ompi_info/components.cc | 6 ++ trunk/ompi/tools/ompi_info/ompi_info.cc | 1 + trunk/opal/class/Makefile.am | 2 ++ trunk/orte/tools/orterun/orterun.c | 3 +++ 6 files changed, 26 insertions(+), 0 deletions(-) Diff not shown due to size (183702 bytes). To see the diff, run the following command: svn diff -r 17176:17177 --no-diff-deleted ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn