Re: [OMPI devel] Process placement

2016-05-05 Thread Aurélien Bouteiller
Ralph, 

I still observe these issues in the current master. (npernode is not respected 
either).

Also note that the display_allocation seems to be wrong (slots_inuse=0 when the 
slot is obviously in use). 

$ git show 
4899c89 (HEAD -> master, origin/master, origin/HEAD) Fix a race condition when 
multiple threads try to create a bml enBouteiller  6 hours ago

$ bin/mpirun -np 12 -hostfile /opt/etc/ib10g.machinefile.ompi 
-display-allocation -map-by nodehostname 

==   ALLOCATED NODES   ==
dancer00: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer01: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer02: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer03: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer04: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer05: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer06: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer07: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer08: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer09: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer10: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer11: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer12: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer13: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer14: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
dancer15: flags=0x13 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
=
dancer01
dancer00
dancer01
dancer01
dancer01
dancer00
dancer00
dancer00
dancer00
dancer00
dancer00
dancer00


--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 
<https://icl.cs.utk.edu/~bouteill/>
> Le 13 avr. 2016 à 13:38, Ralph Castain <r...@open-mpi.org> a écrit :
> 
> The —map-by node option should now be fixed on master, and PRs waiting for 
> 1.10 and 2.0
> 
> Thx!
> 
>> On Apr 12, 2016, at 6:45 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> 
>> FWIW: speaking just to the —map-by node issue, Josh Ladd reported the 
>> problem on master as well yesterday. I’ll be looking into it on Wed.
>> 
>>> On Apr 12, 2016, at 5:53 PM, George Bosilca <bosi...@icl.utk.edu 
>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>> 
>>> 
>>> 
>>> On Wed, Apr 13, 2016 at 1:59 AM, Gilles Gouaillardet <gil...@rist.or.jp 
>>> <mailto:gil...@rist.or.jp>> wrote:
>>> George,
>>> 
>>> about the process binding part
>>> 
>>> On 4/13/2016 7:32 AM, George Bosilca wrote:
>>> Also my processes, despite the fact that I asked for 1 per node, are not 
>>> bound to the first core. Shouldn’t we release the process binding when we 
>>> know there is a single process per node (as in the above case) ?
>>> did you expect the tasks are bound to the first *core* on each node ?
>>> 
>>> i would expect the tasks are bound to the first *socket* on each node.
>>> 
>>> In this particular instance, where it has been explicitly requested to have 
>>> a single process per node, I would have expected the process to be unbound 
>>> (we know there is only one per node). It is the responsibility of the 
>>> application to bound itself or its thread if necessary. Why are we 
>>> enforcing a particular binding policy?
>>> 
>>> (since we do not know how many (OpenMP or other) threads will be used by 
>>> the application, 
>>> --bind-to socket is a good policy imho. in this case (one task per node), 
>>> no binding at all would mean
>>> the task can migrate from one socket to the other, and/or OpenMP threads 
>>> are bound accross sockets.
>>> That would trigger some NUMA effects (better bandwidth if memory is locally 
>>> accessed, but worst performance
>>> is memory is allocated only on one socket).
>>> so imho, --bind-to socket is still my preferred policy, even if there is 
>>> only one MPI task per node.
>>> 
>>> Open MPI is about MPI ranks/processes. I don't think it is our job to try 
>>> to figure out how the user handle do with it's own threads.
>>> 
>>> Your justification make sense if the application only uses a single socket. 
>>> It also make sense if one starts multiple ranks per node, and the internal 
>>> thre

Re: [OMPI devel] Confusion about slots

2016-03-23 Thread Aurélien Bouteiller
To add to what Ralf said, you probably do not want to use Hyper Threads for HPC 
workloads, as that generally results in very poor performance (as you noticed). 
Set the number of slots to the number of real cores (not HT), that would yield 
optimal results 95% of the time. 

Aurélien 

--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 
<https://icl.cs.utk.edu/~bouteill/>
> Le 23 mars 2016 à 16:24, Ralph Castain <r...@open-mpi.org> a écrit :
> 
> “Slots” are an abstraction commonly used by schedulers as a way of indicating 
> how many processes are allowed to run on a given node. It has nothing to do 
> with hardware, either cores or HTs.
> 
> MPI programmers frequently like to bind a process to one or more hardware 
> assets (cores or HTs). Thus, you will see confusion in the community where 
> people mix the term “slot” with “cores” or “cpus”. This is unfortunate as it 
> the terms really do mean very different things.
> 
> In OMPI, we chose to try and “help” the user by not requiring them to specify 
> detailed info in a hostfile. So if you don’t specify the number of “slots” 
> for a given node, we will sense the number of cores on that node and set the 
> slots to match that number. This best matches user expectations today.
> 
> If you do specify the number of slots, then we use that to guide the desired 
> number of processes assigned to each node. We then bind each of those 
> processes according to the user-provided guidance.
> 
> HTH
> Ralph
> 
>> On Mar 23, 2016, at 9:35 AM, Federico Reghenzani 
>> <federico1.reghenz...@mail.polimi.it 
>> <mailto:federico1.reghenz...@mail.polimi.it>> wrote:
>> 
>> Ok, I've investigated further today, it seems "--map-by hwthread" does not 
>> remove the problem. However, if I specified in the hostfile "node0 slots=32" 
>> it runs really slower than specifying only "node0". In both cases I run 
>> mpirun with -np 32. So I'm quite sure I didn't understand what slots are.  
>> 
>> __
>> Federico Reghenzani
>> M.Eng. Student @ Politecnico di Milano
>> Computer Science and Engineering
>> 
>> 
>> 
>> 2016-03-22 18:56 GMT+01:00 Federico Reghenzani 
>> <federico1.reghenz...@mail.polimi.it 
>> <mailto:federico1.reghenz...@mail.polimi.it>>:
>> Hi guys,
>> 
>> I'm really confused about slots in resource allocation: I thought that slots 
>> are the number of processes spawnable in a certain node, so it should 
>> correspond to the number of Processing Elements of the node. For example, on 
>> each of my nodes I have 2 processors, total 16 cores with hyperthreading, so 
>> a total of 32 processing elements per node (i.e. 32 hw threads). However, 
>> considering a single node, passing in the hostfile 32 slots and requesting 
>> "-np 32" results is a performance degradation of 20x slower than using only 
>> "-np 16". The problem disappears specifing --map-by hwthread.
>> 
>> Investigating on the problem I found these counterintuitive things:
>> - here 
>> <https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles> is 
>> stated "slots are Open MPI's representation of how many processors are 
>> available"
>> - here <https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php#sect6> is 
>> stated "Slots indicate how many processes can potentially execute on a node. 
>> For best performance, the number of slots may be chosen to be the number of 
>> cores on the node or the number of processor sockets" 
>> - I tried to remove the slots information from the hostfile, so according to 
>> this 
>> <https://www.open-mpi.org/faq/?category=running#slots-without-hostfiles> 
>> should be interpreted as "1", but it spawns anyway 32 processes
>> - I'm not sure what --map-by and --rank-by do 
>> 
>> In custom RAS we are developing, what we have to send to mpirun? The number 
>> of processor sockets, the number of cores or the number of hwthread 
>> available? How --map-by and --rank-by affect the spawn policy?
>> 
>> 
>> Thank you!
>> 
>> 
>> OFFTOPIC: is someone going to EuroMPI 2016 in September? We will be there to 
>> present our migration technique.
>> 
>> 
>> Cheers,
>> Federico
>> 
>> __
>> Federico Reghenzani
>> M.Eng. Student @ Politecnico di Milano
>> Computer Science and Engineering
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/03/18723.php 
>> <http://www.open-mpi.org/community/lists/devel/2016/03/18723.php>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/03/18724.php



smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] use-mpi mpiext?

2016-02-24 Thread Aurélien Bouteiller
I am making an MPI extension in latest master. I have a problem with the 
use-mpi part of the extension: 

Makefile.am contains the following
 13 headers = \
 14 >...mpiext_blabla_usempi.h
 15 
 16 noinst_HEADERS = \
 17 $(headers)

For some reason, the build system tries to compile a .a for the usempi 
extension. My understanding is that it should use the same bindings as the 
mpifh.a extension (which builds successfully). 

make[1]: Leaving directory 
`/home/bouteill/ompi/debug.build/ompi/mpi/fortran/mpif-h'
Making install in mpi/fortran/use-mpi-ignore-tkr
make[1]: Entering directory 
`/home/bouteill/ompi/debug.build/ompi/mpi/fortran/use-mpi-ignore-tkr'
  FCLD libmpi_usempi_ignore_tkr.la
libtool: link: cannot find the library 
`../../../../ompi/mpiext/blabla/use-mpi/libmpiext_blabla_usempi.la' or 
unhandled argument 
`../../../../ompi/mpiext/blabla/use-mpi/libmpiext_blabla_usempi.la'
make[1]: *** [libmpi_usempi_ignore_tkr.la] Error 1


--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/



smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] Remote orted verbosity

2015-11-23 Thread Aurélien Bouteiller
Frederico, 

Just add -debug-daemons to the mpirun command options. 

Aurélien
--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ 
<https://icl.cs.utk.edu/~bouteill/>
> Le 23 nov. 2015 à 08:55, Federico Reghenzani 
> <federico1.reghenz...@mail.polimi.it> a écrit :
> 
> Hi!
> 
> Is there any way to get the output of OPAL_OUTPUT_VERBOSE on remote orteds? 
> (or write it to a local file?).
> 
> We tried with --mca orte_debug_verbose but it works only for the local 
> machine (= where mpirun is executed).
> 
> 
> Cheers,
> Federico
> 
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/11/18383.php



[OMPI devel] smcuda higher exclusivity than anything else?

2015-05-20 Thread Aurélien Bouteiller
I was making basic performance measurements on our machine after installing 
1.8.5, the performance were looking bad. It turns out that the smcuda btl has a 
higher exclusivity than both vader and sm, even on machines with no nvidia 
adapters. Is there a strong reason why the default exclusivity is set so high ? 
Of course it can be easily fixed with a couple of mca options, but unsuspecting 
users that “just run” will experience 1/3 overhead across the board for shared 
memory communication according to my measurements.


Side note: from my understanding of the smcuda component, performance should be 
identical to the regular sm component (as long as no GPU
operation are required). This is not the case, there is some performance 
penalty with smcuda compared to sm.

Aurelien

--
Aurélien Bouteiller ~~ https://icl.cs.utk.edu/~bouteill/



signature.asc
Description: Message signed with OpenPGP using GPGMail


[OMPI devel] 1.8.5rc1 and OOB on Cray XC30

2015-04-16 Thread Aurélien Bouteiller
a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[aprun6-darter:16915] [[54804,0],0] TCP SHUTDOWN
[aprun6-darter:16915] mca: base: close: component tcp closed
[aprun6-darter:16915] mca: base: close: unloading component tcp



--
Aurélien Bouteiller ~ https://icl.cs.utk.edu/~bouteill/




signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI devel] RFC: "v1.9.0" (vs. "v1.9")

2014-09-22 Thread Aurélien Bouteiller
During the phase where there is not yet a release of “next”, the README and 
other documentations employs the number of the not yet released upcoming 
version. Sometimes when these gets dispatched, outsiders get confused that they 
are using some release version, when in fact they are running a nightly build.  
Reserving a particular number (like 1.9.0) for all non-release versions of a 
general series could help avoid this. 

--
  ~~~ Aurélien Bouteiller, Ph.D. ~~~
 ~ Research Scientist @ ICL ~
The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/




Le 22 sept. 2014 à 14:21, Ralph Castain <r...@open-mpi.org> a écrit :

> Not sure I understand - what do you mean by a "free" number??
> 
> On Sep 22, 2014, at 10:50 AM, Aurélien Bouteiller <boute...@icl.utk.edu> 
> wrote:
> 
>> Could also start at 1.9.1 instead of 1.9.0. That gives a free number for the 
>> “trunk” nightly builds. 
>> 
>> 
>> --
>> ~~~ Aurélien Bouteiller, Ph.D. ~~~
>>~ Research Scientist @ ICL ~
>> The University of Tennessee, Innovative Computing Laboratory
>> 1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
>> tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
>> https://icl.cs.utk.edu/~bouteill/
>> 
>> 
>> 
>> 
>> Le 22 sept. 2014 à 13:38, Jeff Squyres (jsquyres) <jsquy...@cisco.com> a 
>> écrit :
>> 
>>> WHAT: Change our version numbering scheme to always include all 3 numbers 
>>> -- even when the 3rd number is 0.
>>> 
>>> WHY: I think we made a mistake years ago when we designed the version 
>>> number scheme.  It's weird that we drop the last digit when it is 0.
>>> 
>>> WHERE: Trivial patch.  See below.
>>> 
>>> WHEN: Tuesday teleconf next week, 30 Sep 2014
>>> 
>>> MORE DETAIL:
>>> 
>>> Right now, per http://www.open-mpi.org/software/ompi/versions/, when the 
>>> 3rd digit of our version number is zero, we drop it in the filename and 
>>> various other outputs (e.g., ompi_info).  For example, we have:
>>> 
>>> openmpi-1.8.tar.bz2
>>> instead of openmpi-1.8.0.tar.bz2
>>> 
>>> Honestly, I think that's just a little weird.  I think I was the one who 
>>> advocated for dropping the 0 way back in the beginning, but I'm now 
>>> changing my mind.  :-)
>>> 
>>> Making this change will be immediately obvious in the filename of the trunk 
>>> nightly tarball.  It won't affect the v1.8 series (or any prior series), 
>>> because they're all well past their .0 releases.  But it will mean that the 
>>> first release in the v1.9 series will be "v1.9.0".
>>> 
>>> Finally, note that this will also apply to all version numbers shown in 
>>> ompi_info (e.g., components and projects).
>>> 
>>> Here's the diff:
>>> 
>>> Index: config/opal_get_version.m4
>>> ===
>>> --- config/opal_get_version.m4  (revision 32771)
>>> +++ config/opal_get_version.m4  (working copy)
>>> @@ -60,12 +60,7 @@
>>> p" < "$1"`
>>> [eval] "$opal_vers"
>>> 
>>> -# Only print release version if it isn't 0
>>> -if test $$2_RELEASE_VERSION -ne 0 ; then
>>> -
>>> $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION"
>>> -else
>>> -$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION"
>>> -fi
>>> +
>>> $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION"
>>>   $2_VERSION="${$2_VERSION}${$2_GREEK_VERSION}"
>>>   $2_BASE_VERSION=$$2_VERSION
>>> 
>>> Index: opal/runtime/opal_info_support.c
>>> ===
>>> --- opal/runtime/opal_info_support.c(revision 32771)
>>> +++ opal/runtime/opal_info_support.c(working copy)
>>> @@ -1099,14 +1099,8 @@
>>>   temp[BUFSIZ - 1] = '\0';
>>>   if (0 == strcmp(scope, opal_info_ver_full) ||
>>>   0 == strcmp(scope, opal_info_ver_all)) {
>>> -snprintf(temp, BUFSIZ - 1, "%d.%d", major, minor);
>>> +snprintf(temp, BUFSIZ - 1, "%d.%d.%d", major, minor, release);
>>>   str = strdup(temp)

Re: [OMPI devel] RFC: "v1.9.0" (vs. "v1.9")

2014-09-22 Thread Aurélien Bouteiller
Could also start at 1.9.1 instead of 1.9.0. That gives a free number for the 
“trunk” nightly builds. 


--
  ~~~ Aurélien Bouteiller, Ph.D. ~~~
 ~ Research Scientist @ ICL ~
The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375   fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/




Le 22 sept. 2014 à 13:38, Jeff Squyres (jsquyres) <jsquy...@cisco.com> a écrit :

> WHAT: Change our version numbering scheme to always include all 3 numbers -- 
> even when the 3rd number is 0.
> 
> WHY: I think we made a mistake years ago when we designed the version number 
> scheme.  It's weird that we drop the last digit when it is 0.
> 
> WHERE: Trivial patch.  See below.
> 
> WHEN: Tuesday teleconf next week, 30 Sep 2014
> 
> MORE DETAIL:
> 
> Right now, per http://www.open-mpi.org/software/ompi/versions/, when the 3rd 
> digit of our version number is zero, we drop it in the filename and various 
> other outputs (e.g., ompi_info).  For example, we have:
> 
>   openmpi-1.8.tar.bz2
> instead of openmpi-1.8.0.tar.bz2
> 
> Honestly, I think that's just a little weird.  I think I was the one who 
> advocated for dropping the 0 way back in the beginning, but I'm now changing 
> my mind.  :-)
> 
> Making this change will be immediately obvious in the filename of the trunk 
> nightly tarball.  It won't affect the v1.8 series (or any prior series), 
> because they're all well past their .0 releases.  But it will mean that the 
> first release in the v1.9 series will be "v1.9.0".
> 
> Finally, note that this will also apply to all version numbers shown in 
> ompi_info (e.g., components and projects).
> 
> Here's the diff:
> 
> Index: config/opal_get_version.m4
> ===
> --- config/opal_get_version.m4(revision 32771)
> +++ config/opal_get_version.m4(working copy)
> @@ -60,12 +60,7 @@
>   p" < "$1"`
>   [eval] "$opal_vers"
> 
> -# Only print release version if it isn't 0
> -if test $$2_RELEASE_VERSION -ne 0 ; then
> -
> $2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION"
> -else
> -$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION"
> -fi
> +$2_VERSION="$$2_MAJOR_VERSION.$$2_MINOR_VERSION.$$2_RELEASE_VERSION"
> $2_VERSION="${$2_VERSION}${$2_GREEK_VERSION}"
> $2_BASE_VERSION=$$2_VERSION
> 
> Index: opal/runtime/opal_info_support.c
> ===
> --- opal/runtime/opal_info_support.c  (revision 32771)
> +++ opal/runtime/opal_info_support.c  (working copy)
> @@ -1099,14 +1099,8 @@
> temp[BUFSIZ - 1] = '\0';
> if (0 == strcmp(scope, opal_info_ver_full) ||
> 0 == strcmp(scope, opal_info_ver_all)) {
> -snprintf(temp, BUFSIZ - 1, "%d.%d", major, minor);
> +snprintf(temp, BUFSIZ - 1, "%d.%d.%d", major, minor, release);
> str = strdup(temp);
> -if (release > 0) {
> -snprintf(temp, BUFSIZ - 1, ".%d", release);
> -asprintf(, "%s%s", str, temp);
> -free(str);
> -str = tmp;
> -}
> if (NULL != greek) {
> asprintf(, "%s%s", str, greek);
> free(str);
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15887.php



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI devel] KNEM + user-space hybrid for sm BTL

2013-07-18 Thread Aurélien Bouteiller

Le 18 juil. 2013 à 11:12, "Iliev, Hristo" <il...@rz.rwth-aachen.de> a écrit :

> Hello,
>  
> Could someone, who is more familiar with the architecture of the sm BTL, 
> comment on the technical feasibility of the following: is it possible to 
> easily extend the BTL (i.e. without having to rewrite it completely from 
> scratch) so as to be able to perform transfers using both KNEM (or other 
> kernel-assisted copying mechanism) for messages over a given size and the 
> normal user-space mechanism for smaller messages with the switch-over point 
> being a user-tunable parameter?
>  
> From what I’ve seen, both implementations have something in common, e.g. both 
> use FIFOs to communicate controlling information.
> The motivation behind this are our efforts to become greener by extracting 
> the best possible out of the box performance on our systems without having to 
> profile each and every user application that runs on them. We’ve already 
> determined that activating KNEM really benefits some collective operations on 
> big shared-memory systems, but the increased latency significantly slows down 
> small message transfers, which also hits the pipelined implementations.
>  


Hristo, 

The knem BTL currently available in the trunk does just this :) You can use 
either Knem or Linux CMA to accelerate interprocess transfers. You can use the 
following mca parameters to turn on knem mode: 

-mca btl_sm_use_knem 1

If my memory serves me well, anything under eager limit is sent by regular 
double copy: 

-mca btl_sm_eager_limit 4096 (is the default, so anything below 1 page is 
copy-in, copy-out). If I remember correctly, anything below 16k decreased 
performance. 



We also have a collective component leveraging on knem capabilities. If you 
want more info about the details,
you can look at the following paper we published at IPDPS last year. It covers 
what we found to be the best cutoff values for using (or not) knem in several 
collective. 

Teng Ma, George Bosilca, Aurelien Bouteiller, Jack Dongarra, "HierKNEM: An 
Adaptive Framework for Kernel-Assisted and Topology-Aware Collective 
Communications on Many-core Clusters," Parallel and Distributed Processing 
Symposium, International, pp. 970-982, 2012 IEEE 26th International Parallel 
and Distributed Processing Symposium, 2012 

http://www.computer.org/csdl/proceedings/ipdps/2012/4675/00/4675a970-abs.html


Enjoy, 
Aurelien 



> sm’s code doesn’t seem to be very complex but still I’ve decided to ask first 
> before diving any deeper.
>  
> Kind regards,
> Hristo
> --
> Hristo Iliev, PhD – High Performance Computing Team
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23, D 52074 Aachen (Germany)
>  
>  
> _______
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375










Re: [OMPI devel] [EXTERNAL] Re: RFC: support for Mellanox's "libhcoll" library

2013-06-18 Thread Aurélien Bouteiller
If it is Mellanox specific, maybe the component name could reflect this (like 
mlxhcoll), as it will be visible to end-users. 


Aurelien


Le 18 juin 2013 à 11:25, "Barrett, Brian W" <bwba...@sandia.gov> a écrit :

> In general, I'm ok with it.  I think we should let it soak for a week or
> two in the trunk before we file the CMR to 1.7.
> 
> Brian
> 
> On 6/18/13 6:51 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:
> 
>> Sounds good; +1.
>> 
>> On Jun 18, 2013, at 8:02 AM, Joshua Ladd <josh...@mellanox.com> wrote:
>> 
>>> Request for Change:
>>> 
>>> What: Add support for Mellanox Technologies¹ next-generation
>>> non-blocking collectives, code-named ³libhcoll². This comes in the form
>>> of a new ³hcoll² component to the ³coll² framework.
>>> 
>>> Where: Trunk and 1.7
>>> 
>>> When: July 1
>>> 
>>> Why: In support of MPI 3, Mellanox Technologies will make available its
>>> next generation collectives library, ³libhcoll²,  in MOFED 2.0 releases
>>> and higher starting in the late 2013 timeframe. ³Libhcoll² adds support
>>> for truly asynchronous non-blocking collectives on supported HCAs
>>> (Connect X-3 and higher) via Mellanox Technologies¹ CORE-Direct
>>> technology. ³Libhcoll² also adds support for hierarchical collectives
>>> and features a highly scalable infrastructure battle tested and proven
>>> on some of the world¹s largest HPC systems.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Joshua S. Ladd, PhD
>>> HPC Algorithms Engineer
>>> Mellanox Technologies
>>> 
>>> Email: josh...@mellanox.com
>>> Cell: +1 (865) 258 - 8898
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> --
>  Brian W. Barrett
>  Scalable System Software Group
>  Sandia National Laboratories
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375










Re: [OMPI devel] June OMPI developer's meeting

2013-05-10 Thread Aurélien Bouteiller
I will be attending. 

Can some local chime in and tell me how practical it is to land in San 
Francisco and use public transportation to go to San Jose? Plane schedule to 
San Jose directly is not very flexible. 

Aurelien 



Le 7 mai 2013 à 15:19, Larry Baker <ba...@usgs.gov> a écrit :

> On 6 May 2013, at 11:14 AM, Jeff Squyres (jsquyres) wrote:
> 
>> We typically do something informally scheduled on the day of, or somesuch 
>> (e.g., around 4pm people start wondering aloud what we should do for dinner 
>> :-) ).  But if there is interest for others to attend, we can probably set 
>> up something ahead of time.
> 
> This option will work best for me.  All I need is an e-mail notice of where 
> and when within 30 minutes or so of the reservation time (depending on the 
> traffic on 101 :) ).
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375










Re: [OMPI devel] enabling ft-enable-cr + vprotocol

2012-07-23 Thread Aurélien Bouteiller
Tiago, 

I have never tried to do this, I'm sorry to hear it doesn't work. 

I am very busy at the moment, but I'll try to upgrade the pessimist protocol in 
the trunk with my latest internal repo, that contains some features to mix 
coordinated and message logging, as soon as possible. 

Aurelien


Le 22 juil. 2012 à 18:47, tiago essex a écrit :

> hi,
> 
> i have been playing around with the code of the pessimist protocol and i have 
> set it so to save some messages and some other specific information into a 
> few files.
> 
> however i also need to be able to perform global checkpoint during execution.
> i was wondering if it's possible to simultaneous set the mca parameters for 
> both the coordinated checkpoint and the vprototocol at the same time, 
> something like this:
> 
> mpirun -n 10 -am ft-enable-cr -mca crs blcr -mca vprotocol pessimist prog
> 
> i have tried, but it seems that the vprotocol does not work with ft-enable-cr 
> enable. is there a way to overcome this? or i'm missing something?
> 
> 
> thank you ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375









signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI devel] Pessimist Event Logger

2012-01-27 Thread Aurélien Bouteiller
Hugo, 

It seems you want to implement some sort of remote pessimistic logging -a la 
MPICH-V1- ? 
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes -- George 
Bosilca, Aurélien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fédak, 
Cécile Germain, Thomas Hérault, Pierre Lemarinier, Oleg Lodygensky, Frédéric 
Magniette, Vincent Néri, Anton Selikhov -- In proceedings of The IEEE/ACM 
SC2002 Conference, Baltimore USA, November 2002

In the PML-V, unlike older designs, the payload of messages and the 
non-deterministic events follow a different path. The payload of messages is 
logged on the sender's volatile memory, while the non-deterministic events are 
sent to a stable event logger, before allowing the process to impact the state 
of others (the code you have found in the previous email). The best depiction 
of this distinction can be read in this paper 
@inproceedings{DBLP:conf/europar/BouteillerHBD11,
  author= {Aurelien Bouteiller and
   Thomas H{\'e}rault and
   George Bosilca and
   Jack J. Dongarra},
  title = {Correlated Set Coordination in Fault Tolerant Message Logging
   Protocols},
  booktitle = {Euro-Par 2011 Parallel Processing - 17th International 
Conference, Proceedings, Part II},
  month = {September},
  year  = {2011},
  pages = {51-64},
  publisher = {Springer},
  series= {Lecture Notes in Computer Science},
  volume= {6853},
  year  = {2011},
  isbn  = {978-3-642-23396-8},
  doi   = {http://dx.doi.org/10.1007/978-3-642-23397-5_6},




If you intend to store both payload and message log on a remote node, I suggest 
you look at the "sender-based" hooks, as this is where the message payload is 
managed, and adapt from here. The event loggers can already manage a subset 
only of the processes (if you launch as many EL as processes, you get a 1-1 
mapping), but they never handle message payload; you'll have to add all this 
yourself is it so pleases you. 

Hope it clarifies. 
Aurelien




Le 27 janv. 2012 à 11:19, Hugo Daniel Meyer a écrit :

> Hello Aurélien.
> 
> Thanks for the clarification. Considering what you've mentioned i will have 
> to make some adaptations, because to me, every single message has to be 
> logged. So, a sender not only will be sending messages to the receiver, but 
> also to an event logger. Is there any considerations that i've to take into 
> account when modifying the code?. My initial idea is to use the el_comm with 
> a group of event loggers (because every node uses a different event logger in 
> my approach), and then send the messages to them as you do when using 
> MPI_ANY_SOURCE. 
> 
> Thanks for your help.
> 
> Hugo Meyer
> 
> 2012/1/27 Aurélien Bouteiller <boute...@eecs.utk.edu>
> Hugo,
> 
> Your program does not have non-deterministic events. Therefore, there are no 
> events to log. If you add MPI_ANY_SOURCE, you should see this code being 
> called. Please contact me again if you need more help.
> 
> Aurelien
> 
> 
> Le 27 janv. 2012 à 10:21, Hugo Daniel Meyer a écrit :
> 
> > Hello @ll.
> >
> > George, i'm using some pieces of the pessimist vprotocol. I've observed 
> > that when you do a send, you call vprotocol_receiver_event_flush and here 
> > the macro __VPROTOCOL_RECEIVER_SEND_BUFFER is called. I've noticed that 
> > here you try send a copy of the message to process 0 using the el_comm. 
> > This section of code is never executed, at least in my examples. So, the 
> > message is never sent to the Event Logger, am i correct with this?  I think 
> > that this is happening because the 
> > mca_vprotocol_pessimist.event_buffer_length is always 0.
> >
> > Is there something that i've got to turn on, or i will have to modify this 
> > behavior manually to connect and send messages to the EL?
> >
> > Thanks in advance.
> >
> > Hugo Meyer
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 350
> * Knoxville, TN 37996
> * 865 974 6321
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI devel] Pessimist Event Logger

2012-01-27 Thread Aurélien Bouteiller
Hugo, 

Your program does not have non-deterministic events. Therefore, there are no 
events to log. If you add MPI_ANY_SOURCE, you should see this code being 
called. Please contact me again if you need more help.

Aurelien


Le 27 janv. 2012 à 10:21, Hugo Daniel Meyer a écrit :

> Hello @ll.
> 
> George, i'm using some pieces of the pessimist vprotocol. I've observed that 
> when you do a send, you call vprotocol_receiver_event_flush and here the 
> macro __VPROTOCOL_RECEIVER_SEND_BUFFER is called. I've noticed that here you 
> try send a copy of the message to process 0 using the el_comm. This section 
> of code is never executed, at least in my examples. So, the message is never 
> sent to the Event Logger, am i correct with this?  I think that this is 
> happening because the mca_vprotocol_pessimist.event_buffer_length is always 0.
> 
> Is there something that i've got to turn on, or i will have to modify this 
> behavior manually to connect and send messages to the EL?
> 
> Thanks in advance.
> 
> Hugo Meyer
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [OMPI devel] [OMPI svn] svn:open-mpi r23931

2010-10-25 Thread Aurélien Bouteiller
Ralph, 

In file included from 
../../../../../trunk/opal/mca/event/libevent207/libevent207_module.c:44:
../../../../../trunk/opal/mca/event/libevent207/libevent/event.h:165:33: error: 
event2/event-config.h: No such file or directory


Looks like you forgot some files. 

Aurelien 


Le 25 oct. 2010 à 10:53, r...@osl.iu.edu a écrit :

> Author: rhc
> Date: 2010-10-25 10:53:33 EDT (Mon, 25 Oct 2010)
> New Revision: 23931
> URL: https://svn.open-mpi.org/trac/ompi/changeset/23931
> 
> Log:
> Remove the sample and test code from the libevent distro - don't need to 
> include them in ompi
> 
> Removed:
>   trunk/opal/mca/event/libevent207/libevent/sample/
>   trunk/opal/mca/event/libevent207/libevent/test/
> Text files modified: 
>   trunk/opal/mca/event/libevent207/libevent/Makefile.am  | 2 +-   
>
>   trunk/opal/mca/event/libevent207/libevent/configure.in | 2 +-   
>
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> Modified: trunk/opal/mca/event/libevent207/libevent/Makefile.am
> ==
> --- trunk/opal/mca/event/libevent207/libevent/Makefile.am (original)
> +++ trunk/opal/mca/event/libevent207/libevent/Makefile.am 2010-10-25 
> 10:53:33 EDT (Mon, 25 Oct 2010)
> @@ -85,7 +85,7 @@
>   libevent.pc.in \
>   Doxyfile \
>   whatsnew-2.0.txt \
> - Makefile.nmake test/Makefile.nmake \
> + Makefile.nmake \
>   $(PLATFORM_DEPENDENT_SRC)
> 
> # OMPI: Changed to noinst and libevent.la
> 
> Modified: trunk/opal/mca/event/libevent207/libevent/configure.in
> ==
> --- trunk/opal/mca/event/libevent207/libevent/configure.in(original)
> +++ trunk/opal/mca/event/libevent207/libevent/configure.in2010-10-25 
> 10:53:33 EDT (Mon, 25 Oct 2010)
> @@ -838,4 +838,4 @@
> fi
> 
> AC_CONFIG_FILES( [libevent.pc libevent_openssl.pc libevent_pthreads.pc] )
> -AC_OUTPUT(Makefile include/Makefile test/Makefile sample/Makefile)
> +AC_OUTPUT(Makefile include/Makefile)
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn




[OMPI devel] orte does not compile on XT5 (pgcc)

2010-09-29 Thread Aurélien Bouteiller
Here is the problem. The PGI compiler is especially paranoid regarding post 
declared structures typedefs. It looks like the include ordering makes the 
nidmap.h file being included before orte_jmap_t typedefs and siblings have been 
done. 

/opt/cray/xt-asyncpe/4.0/bin/cc: INFO: linux target is being used
PGC-S-0040-Illegal use of symbol, orte_jmap_t 
(../../../../../trunk/orte/util/nidmap.h: 47)
PGC-W-0156-Type not specified, 'int' assumed 
(../../../../../trunk/orte/util/nidmap.h: 47)
PGC-S-0040-Illegal use of symbol, orte_pmap_t 
(../../../../../trunk/orte/util/nidmap.h: 48)
PGC-W-0156-Type not specified, 'int' assumed 
(../../../../../trunk/orte/util/nidmap.h: 48)
PGC-S-0040-Illegal use of symbol, orte_nid_t 
(../../../../../trunk/orte/util/nidmap.h: 49)
PGC-W-0156-Type not specified, 'int' assumed 
(../../../../../trunk/orte/util/nidmap.h: 49)
PGC-S-0040-Illegal use of symbol, orte_jmap_t 
(../../../../../trunk/orte/util/nidmap.h: 63)
PGC-W-0156-Type not specified, 'int' assumed 
(../../../../../trunk/orte/util/nidmap.h: 63)
PGC-S-0074-Non-constant expression in initializer 
(../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 95)
PGC-S-0074-Non-constant expression in initializer 
(../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 103)
PGC-W-0093-Type cast required for this conversion of constant 
(../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 109)
PGC-W-0093-Type cast required for this conversion of constant 
(../../../../../trunk/orte/mca/ess/slave/ess_slave_module.c: 109)
PGC/x86-64 Linux 10.5-0: compilation completed with severe errors

Aurelien




Re: [OMPI devel] Autogen.pl, romio and autoconf 2.66

2010-09-28 Thread Aurélien Bouteiller

Le 28 sept. 2010 à 18:10, Aurélien Bouteiller a écrit :

> 
> Le 28 sept. 2010 à 17:55, Jeff Squyres a écrit :
> 
>> On Sep 28, 2010, at 5:30 PM, Aurélien Bouteiller wrote:
>> 
>>> Hi there, 
>>> 
>>> has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when 
>>> configuring romio with the following error: 
>>> === Processing subdir: 
>>> /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio
>>> --- Found configure.in|ac; running autoreconf...
>>> autoreconf: Entering directory `.'
>>> autoreconf: configure.in: not using Gettext
>>> autoreconf: running: aclocal --force 
>>> configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library
>> 
>> This looks like Libtool or Automake isn't installed properly...?
You were right on that one. The system provided automake on Kraken is broken. 
Fixed by installing my own. 

>> 
> That's a possibility, but one problem at a time :) 
>> 
>> 
>>> configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments
>>> ../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from...
> Apparently, after making some internet search, it looks like autoconf 2.66 is 
> plain broken. I'll try with another one and report on this issue. 
> 
Confirmed. Autoconf 2.66 cannot compile romio, 2.68 and 2.65 can, no problem. 

> Aurelien 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Autogen.pl, romio and autoconf 2.66

2010-09-28 Thread Aurélien Bouteiller

Le 28 sept. 2010 à 17:55, Jeff Squyres a écrit :

> On Sep 28, 2010, at 5:30 PM, Aurélien Bouteiller wrote:
> 
>> Hi there, 
>> 
>> has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when 
>> configuring romio with the following error: 
>> === Processing subdir: 
>> /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio
>> --- Found configure.in|ac; running autoreconf...
>> autoreconf: Entering directory `.'
>> autoreconf: configure.in: not using Gettext
>> autoreconf: running: aclocal --force 
>> configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library
> 
> This looks like Libtool or Automake isn't installed properly...?
> 
That's a possibility, but one problem at a time :) 
> 
> 
>> configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments
>> ../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from...
Apparently, after making some internet search, it looks like autoconf 2.66 is 
plain broken. I'll try with another one and report on this issue. 

Aurelien 




[OMPI devel] Autogen.pl, romio and autoconf 2.66

2010-09-28 Thread Aurélien Bouteiller
Hi there, 

has anybody tried to compile ompi trunk with autoconf 2.66 ? It fails when 
configuring romio with the following error: 
=== Processing subdir: /nics/c/home/bouteill/ompi/trunk/ompi/mca/io/romio/romio
--- Found configure.in|ac; running autoreconf...
autoreconf: Entering directory `.'
autoreconf: configure.in: not using Gettext
autoreconf: running: aclocal --force 
configure.in:2127: warning: macro `AM_PROG_LIBTOOL' not found in library
configure.in:791: error: AC_CHECK_SIZEOF: requires literal arguments
../../lib/autoconf/types.m4:765: AC_CHECK_SIZEOF is expanded from...
configure.in:791: the top level
autom4te: /sw/xt/autoconf/2.66/cnl2.2_gnu4.4.4/bin/m4 failed with exit status: 1
aclocal: /sw/xt/autoconf/2.66/cnl2.2_gnu4.4.4/bin/autom4te failed with exit 
status: 1
autoreconf: aclocal failed with exit status: 1
Command failed: autoreconf -ivf

Should I demote my autoconf to 2.65 ? 

Thanks, 
Aurelien 




Re: [OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-24 Thread Aurélien Bouteiller
btl is the component responsible for a particular type of fabric. Endpoint is 
somewhat the instantiation of a btl to reach a particular destination on a 
particular fabric, proc is the generic name and properties of a destination. 

Aurelien

Le 24 févr. 2010 à 09:59, hu yaohui a écrit :

> Could someone tell me the relationship between proc,endpoint and btl?
>  thanks & regards
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Vprotocol pessimist - Open MPI 1.4.1 and 1.4.2a1r22558

2010-02-24 Thread Aurélien Bouteiller
Hi, 

The instructions you found are now obsolete. I'll update them, thank you for 
pointing out.

The new procedure to use uncoordinated checkpoint is now 
mpirun -mca vprotocol pessimist -mca pml ob1,v [regular arguments]. 

The version available in trunk does not support actual restart due to lack of 
runtime support, and is limited to performance evaluation of FT cost without 
failures. There is an ongoing proposal to include such support in the main 
branch. However, we do have a branched version of Open MPI including all the 
necessary support that I can be provided on request. Please also consider that 
this is an ongoing research effort that has not yet matured enough to be used 
in a production environment. 

Aurelien Bouteiller
--
Dr. Aurelien Bouteiller
Innovative Computing Laboratory at the University of Tennessee



Le 6 févr. 2010 à 10:21, Caciano Machado a écrit :
> Hi,
> 
> I'm following the instructions found at
> https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR to run an
> application with the vprotocol pessimist enabled. I believe that I'm
> doing something wrong but I can't figure out the problem.
> 
> I have compiled Open MPI 1.4.1 and 1.4.2a1r22558 with the parameters:
> ./configure --prefix=/usr/local/openmpi-v/ --with-ft=cr
> --with-blcr=/usr/local/blcr/
> 
> Here is my configuration file:
> vprotocol_pessimist_priority=10
> pml_base_verbose=10
> pbl_v_verbose=500
> 
> The command line:
> mpirun -am /etc/v -np 2 -machinefile /etc/machinefile ep.B.8
> 
> And the mpirun output:
> ##3
> [xiru-10:03440] mca: base: components_open: Looking for pml components
> [xiru-10:03440] mca: base: components_open: opening pml components
> [xiru-10:03440] mca: base: components_open: found loaded component cm
> [xiru-10:03440] mca: base: components_open: component cm has no
> register function
> [xiru-10:03440] mca: base: component_find: unable to open
> /usr/local/openmpi-v/lib/openmpi/mca_mtl_mx: perhaps a missing symbol,
> or compiled for a different version of Open MPI? (ignored)
> 
> [xiru-10:03440] mca: base: components_open: component cm open function
> successful
> [xiru-10:03440] mca: base: components_open: found loaded component crcpw
> [xiru-10:03440] mca: base: components_open: component crcpw has no
> register function
> [xiru-10:03440] mca: base: components_open: component crcpw open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component csum
> [xiru-10:03440] mca: base: components_open: component csum has no
> register function
> [xiru-10:03440] mca: base: component_find: unable to open
> /usr/local/openmpi-v/lib/openmpi/mca_btl_mx: perhaps a missing symbol,
> or compiled for a different version of Open MPI? (ignored)
> [xiru-10:03440] mca: base: components_open: component csum open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component ob1
> [xiru-10:03440] mca: base: components_open: component ob1 has no
> register function
> [xiru-10:03440] mca: base: components_open: component ob1 open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component v
> [xiru-10:03440] mca: base: components_open: component v has no register 
> function
> [xiru-10:03440] mca: base: components_open: component v open function 
> successful
> --
> [[65326,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> 
> Module: OpenFabrics (openib)
>  Host: xiru-10.portoalegre.grenoble.grid5000.fr
> 
> Another transport will be used instead, although this may result in
> lower performance.
> --
> [xiru-10:03440] select: initializing pml component cm
> [xiru-10:03440] select: init returned failure for component cm
> [xiru-10:03440] select: component crcpw not in the include list
> [xiru-10:03440] select: component csum not in the include list
> [xiru-10:03440] select: initializing pml component ob1
> [xiru-10:03440] select: init returned priority 20
> [xiru-10:03440] select: component v not in the include list
> [xiru-10:03440] selected ob1 best priority 20
> [xiru-10:03440] select: component ob1 selected
> [xiru-10:03440] mca: base: close: component cm closed
> [xiru-10:03440] mca: base: close: unloading component cm
> [xiru-10:03440] mca: base: close: component crcpw closed
> [xiru-10:03440] mca: base: close: unloading component crcpw
> [xiru-10:03440] mca: base: close: component csum closed
> [xiru-10:03440] mca: base: close: unloading component csum
> [xiru-10:03440] mca: base: close: component v closed
> [xiru-10:03440] mca: base: close: unloading component v
> ...
> 
> #3
> 
> It seems that the vprotocol module is not loading properly. Does
> anyone have a solution to 

[OMPI devel] MCA component dependency

2009-03-25 Thread Aurélien Bouteiller

Hi everyone,

I'm trying to state that a particular component depends on another  
that should therefore be dlopened automatically when it is loaded. I  
found some code doing exactly that in  
mca_base_component_find:open_component, but can't find any example of  
how to actually trigger that code path. Does anybody have a clue of  
what should I do to declare the list of dependencies of my component ?


Thanks,
Aurelien



Re: [OMPI devel] [OMPI svn] svn:open-mpi r20196

2009-01-05 Thread Aurélien Bouteiller
Addendum to the previous message concerning this discussion: I think  
we should stick with including opal_stdint everywhere instead of  
inttypes.h (this file does not always exist on ansi pedantic compilers).


Aurelien


Le 4 janv. 09 à 00:09, timat...@osl.iu.edu a écrit :


Author: timattox
Date: 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009)
New Revision: 20196
URL: https://svn.open-mpi.org/trac/ompi/changeset/20196

Log:
Refs #868, #869

The fix for #868, r14358, introduced an (unneeded?) inconsitency...
For Mac OS X systems, inttypes.h will always be included with  
opal_config.h,
and NOT included for non-Mac OS X systems.  For developers using Mac  
OS X,
this masks the need to include inttypes.h or more properly  
opal_stdint.h.


This changeset corrects one of these oopses.  However, the  
underlying problem

still exists.  Moving the equivelent of r14358 into opal_stdint.h from
opal_config_bottom.h might be the "right" solution, but AFAIK, we  
would then
need to replace each direct inclusion of inttypes.h with  
opal_stdint.h to

properly address tickets #868 and #869.

Text files modified:
  trunk/opal/dss/dss_print.c | 1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

Modified: trunk/opal/dss/dss_print.c
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/opal/dss/dss_print.c  (original)
+++ trunk/opal/dss/dss_print.c	2009-01-04 00:09:18 EST (Sun, 04 Jan  
2009)

@@ -18,6 +18,7 @@

#include "opal_config.h"

+#include "opal_stdint.h"
#include 

#include "opal/dss/dss_internal.h"
___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn





Re: [OMPI devel] [OMPI svn] svn:open-mpi r20196

2009-01-05 Thread Aurélien Bouteiller

Tim,

To answer to your question in ticket #869: the only known missing  
feature to the opal_stdint.h is that there is no portable way to  
printf size_t. Their type is subject to so many changes depending on  
the platform and compiler that it is impossible to be sure that  
PRI_size_t is not gonna dump a lot of warnings. Aside from that, it  
should be pretty solid.


Aurelien



Le 4 janv. 09 à 00:09, timat...@osl.iu.edu a écrit :


Author: timattox
Date: 2009-01-04 00:09:18 EST (Sun, 04 Jan 2009)
New Revision: 20196
URL: https://svn.open-mpi.org/trac/ompi/changeset/20196

Log:
Refs #868, #869

The fix for #868, r14358, introduced an (unneeded?) inconsitency...
For Mac OS X systems, inttypes.h will always be included with  
opal_config.h,
and NOT included for non-Mac OS X systems.  For developers using Mac  
OS X,
this masks the need to include inttypes.h or more properly  
opal_stdint.h.


This changeset corrects one of these oopses.  However, the  
underlying problem

still exists.  Moving the equivelent of r14358 into opal_stdint.h from
opal_config_bottom.h might be the "right" solution, but AFAIK, we  
would then
need to replace each direct inclusion of inttypes.h with  
opal_stdint.h to

properly address tickets #868 and #869.

Text files modified:
  trunk/opal/dss/dss_print.c | 1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

Modified: trunk/opal/dss/dss_print.c
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/opal/dss/dss_print.c  (original)
+++ trunk/opal/dss/dss_print.c	2009-01-04 00:09:18 EST (Sun, 04 Jan  
2009)

@@ -18,6 +18,7 @@

#include "opal_config.h"

+#include "opal_stdint.h"
#include 

#include "opal/dss/dss_internal.h"
___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn





Re: [OMPI devel] Should visibility and memchecker abort configure?

2008-10-03 Thread Aurélien Bouteiller

Hi Ralph,

1. No. Having visibility turned off without knowing it is the best way  
for us to commit bugs in the trunk without noticing, I mean before  
somebody else get the leg caught in the "not-compiling-trunk trap". I  
had more of my share of responsibility for that kind of problems in  
the past, that exactly rooted in visibility issues. I must say that it  
is painful enough that some compilers will just silently ignore  
visibility settings without adding the configure to the chain of guys  
who just do whatever they want regardless of the requested flags. If I  
can't have visibility, I want to know. Especially in debug mode.


2. If Valgrind is not available and this feature requires valgrind, it  
is reasonable to disable it. Anyway, this would not lead to include  
silent bugs in the trunk if it gets disabled "silently". (are you sure  
though ? I used to enable this on my mac, where there is of course no  
valid valgrind installed, and it compiled just fine).


Aurelien

Le 2 oct. 08 à 18:04, Ralph Castain a écrit :


Hi folks

I make heavy use of platform files to provide OMPI support for the  
three NNSA labs. This means supporting multiple compilers, several  
different hardware and software configs, debug vs optimized, etc.


Recently, I have encountered a problem that is making life  
difficult. The problem revolves around two configure options that  
apply to debug builds:


1. --enable-visibility. Frustrating as it may be, some compilers  
just don't support visibility - and others only support it for  
versions above a specific level. Currently, this option will abort  
the configure procedure if the compiler does not support visibility.


2. --enable-memchecker. This framework has a component that requires  
valgrind 3.2 or above. Unfortunately, if a valgrind meeting that  
criteria is not found, this option will also abort the configure  
procedure.


Is it truly -necessary- for these options to abort configure in  
these conditions? Would it be acceptable for:


* visibility just to print a big warning, surrounded by asterisks,  
that the selected compiler does not support visibility - but allow  
the build to continue?


* memchecker to also print a big warning, surrounded by asterisks,  
explaining the valgrind requirement and turn "off" the build of the  
memchecker/valgrind component - but allow the build to continue? It  
would seem to me that we would certainly want this for the future  
anyway as additional memchecker components are supported.


If this would be acceptable, I am happy to help with or implement  
the changes. It would be greatly appreciated.


Thanks
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r19653

2008-09-29 Thread Aurélien Bouteiller
 we need to extend the existing RML function to handle the  
subsequent setting of the route to the proc itself. In the current  
dpm, we automatically assume that the route will be to a different  
job family, and hence send the routing info to the HNP. However,  
this may not be true - e.g., after a comm_spawn, there is no reason  
to route through the HNP since the job family is the same.


This is not correct. The current code in the DPM already takes care of  
the "usual" case where both ends are in the same job family; in that  
case it creates a "direct" route to the remote end (maybe it should  
just do nothing, though). This logic is pretty simple and is well  
contained in the DPM. Moving this logic to the rml should not  
basically change much: the complexity will just move from the dpm to  
the routed. The existing single dpm code already do everything we need  
for current and future use, while we might have to upgrade all the  
routed to take into account this special case. This is why I would  
advocate for the lesser effort for the exact same functionality at the  
end.


Haven't thought it all through yet, but wanted to suggest we think  
about it as we may (per the FT July discussions) need to define  
routes for things other than just DPM-related operations. Perhaps we  
should do some design discussion off-list to see what makes sense?


I'm always open to discussion. Let me know if you find this useful on  
some purpose.



Thanks
Ralph


Aurelien




On Sep 28, 2008, at 8:33 AM, Aurélien Bouteiller wrote:


Ralph,

I just split the existing static function from inside the dpm and  
exposed it to the outside world. The idea is that the dpm create  
the (opaque) port strings and therefore nows how they are supposed  
to be formated. So he is responsible for parsing them. Second, I  
split the parsing and routing in two different functions because  
sometimes you might want to parse without creating a route to the  
target.


I'll check the RML function to see if it offers similar  
functionality om monday. I have no strongly religious belief on  
wether this should be a rml or dpm function. So I don't care as  
long as I have what I need :]


Thanks for the feedback,
Aurelien


Le 27 sept. 08 à 20:53, Ralph Castain a écrit :


Yo Aurelien

Regarding the dpm including a "route_to_port" API. This actually  
is pretty close to being an exact duplicate of an already existing  
function in the RML that takes a URI as it's input, parses it to  
separate the proc name and the contact info, sets the contact info  
into the OOB, sets the route to that proc, and returns the proc  
name to the caller. Take a look at orte/mca/rml/base/ 
rml_base_contact.c.


All we need to do is add the logic to that function so that, if  
the target proc is not in our job family, we update the route and  
contact info in the HNP instead of locally.


This would keep all the "setting_route_to_proc" functionality in  
one place, instead of duplicating it in the dpm, thus making  
maintenance much easier.


Make sense?
Ralph


On Sep 27, 2008, at 7:22 AM, boute...@osl.iu.edu wrote:


Author: bouteill
Date: 2008-09-27 09:22:32 EDT (Sat, 27 Sep 2008)
New Revision: 19653
URL: https://svn.open-mpi.org/trac/ompi/changeset/19653

Log:
Add functions to access the opaque port_string and to add routes  
to a remote port. This is usefull for FT, but could also turn  
usefull when considering MPI3 extentions to the MPI2 dynamics.






Text files modified:
trunk/ompi/mca/dpm/base/base.h  | 3 +
trunk/ompi/mca/dpm/base/dpm_base_null_fns.c |12 
trunk/ompi/mca/dpm/base/dpm_base_open.c | 2
trunk/ompi/mca/dpm/dpm.h|20 +++
trunk/ompi/mca/dpm/orte/dpm_orte.c  |   114 ++ 
+++--

5 files changed, 99 insertions(+), 52 deletions(-)

Modified: trunk/ompi/mca/dpm/base/base.h
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
===

--- trunk/ompi/mca/dpm/base/base.h  (original)
+++ trunk/ompi/mca/dpm/base/base.h	2008-09-27 09:22:32 EDT (Sat,  
27 Sep 2008)

@@ -92,6 +92,9 @@
int ompi_dpm_base_null_dyn_finalize (void);
void ompi_dpm_base_null_mark_dyncomm (ompi_communicator_t *comm);
int ompi_dpm_base_null_open_port(char *port_name, orte_rml_tag_t  
given_tag);

+int ompi_dpm_base_null_parse_port(char *port_name,
+  orte_process_name_t *rproc,  
orte_rml_tag_t *tag);
+int ompi_dpm_base_null_route_to_port(char *rml_uri,  
orte_process_name_t *rproc);

int ompi_dpm_base_null_close_port(char *port_name);

/* useful globals */

Modified: trunk/ompi/mca/dpm/base/dpm_base_null_fns.c
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
===

--- trunk/ompi/mca/dpm/base/dpm_base_null_fns.c (original)
+++ trunk/ompi/mca/dpm/base/dpm_base_null_fns.c	2008-09-27  
09:22:32 EDT (Sat, 2

Re: [OMPI devel] trunk temporarily closed

2008-09-25 Thread Aurélien Bouteiller

Any idea of a timeframe for the problem to get fixed ?

Aurelien

Le 25 sept. 08 à 14:03, Jeff Squyres a écrit :


On Sep 25, 2008, at 1:44 PM, Jeff Squyres (jsquyres) wrote:


The SVN trunk has been temporarily closed due to what may be an
accidental commit.




The entire OMPI SVN is now offline (vs. just the trunk).

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] gdb libmpi.dylib on Leopard

2008-09-19 Thread Aurélien Bouteiller
I filed the following bug report on Apple Developer Connexion. As a  
short summary, I suggest they get in touch with us and include the -- 
whole-archive mechanism in their ld.


Aurelien

19-Sep-2008 03:08 PM Aurelien Bouteiller:
Summary:
Because the Apple ld does not include the GNU's ld --whole-archive/-- 
no-whole-archive mechanism to allow loading of all members of  
selective archives, libtool (including gnu libtool) is forced to  
unpack all the members of a convenience library (and later delete  
them), and afterwards needs to run dsymutil. Unfortunately, because  
the archives are uncompressed to a temporary space before being  
included in the final library, dsymutil seams to get confused. As a  
consequence, it is impossible to debug a library with gdb, the .o  
files never being found, even if the library actually contains all the  
necessary debug symbols.


Steps to reproduce:
1. Download a svn Open MPI trunk release (or any libtool based  
project, I've experienced the same problems when compiling my own  
gcc4.3). Please note that you need autoconf 2.62 and automake 1.10 to  
compile Open MPI trunk.

2. configure Open MPI with the debug options (configure --enable-debug)
3. make install
4. find or create a sample MPI program, mpicc it.
5. mpirun -np 1 gdb mpi_sample_program
6. break MPI_Init, r, n.

Expected results:
6: you should step each line of the MPI_Init function

Actual results:
6. you see a large number of warnings
warning: Could not find object file "/Users/bouteill/ompi/debug.build/ 
opal/.libs/libopen-pal.lax/libmca_memchecker.a/memchecker_base_open.o"  
- no debug information available for "../../../../trunk/opal/mca/ 
memchecker/base/memchecker_base_open.c".


You are unable to step in MPI_Init. Instead the execution continues up  
to reach the "main" function.


Regression:
Used to work with Tiger.

Notes:
If you need some more details or want to cooperate with us, please  
register to the Open MPI devel mailing list. As a major open source  
project we have  been working on a fix for this issue for a while, but  
where unable to correct it without modifications to apple's ld.


We believe that the best workaround would be to include the --whole- 
archive/--no-whole-archive mechanism. Then there is no need anymore to  
unpack the convenience archives before building the .dylib, and as a  
friendly side effect compilation time should improve a lot.


Thanks,
--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321
(on behalf of the Open MPI development community)



Le 19 sept. 08 à 17:22, Jeff Squyres a écrit :


Thanks for following up!

Aurelien, I'll leave this to you -- I rarely do OMPI development on  
my Mac...



On Sep 19, 2008, at 5:08 PM, Ralf Wildenhues wrote:


Hello,

I asked Peter O'Gorman about this issue, and he said

| I believe that running dsymutil on the generated lib would then  
create a

| libfoo.dSYM in the .libs directory conatining all the necessary
| debugging information, which could be used for debugging the  
library in
| the build tree (gdb should find it sitting there next to the  
original
| library and use the debug information  in the .dSYM).  
Libtool-2.2.6 does

| run dsymutil and create the .dSYM though...
|
| There should be a libmpi.dylib in a .libs directory and a
| libmpi.dylib.dSYM directory next to it.

Also, he said that it could help if you reported a bug at
<http://bugreporter.apple.com>, under the notion that the
more people file bugs with them, the more they will understand
what problems users have with the dsymutils issues.

Cheers,
Ralf

* Aurélien Bouteiller wrote on Fri, Sep 19, 2008 at 09:44:46PM CEST:

Ok,

I didn't forgot to rerun autogen.sh (I even erased the libltdl, and
various libtool wrappers that are generated at autogen/configure  
time). I
checked the link Ralf submitted to our attention. This is exactly  
the
same problem, or at least the same symptoms. The last version of  
libtool

runs dsymutil on the created .so/.dylib, but the bad thing is that
dsymutil returns similar warning message about missing ".lax" files.
Therefore, even running it manually on the .dsym does not help.

I upgraded (compiled my own copy) my gcc to 4.3.2 (you should do  
it too,
Jeff, the experimental have been giving me headaches in the past).  
Now, I
also have the same warning messages for internal libs of gcc than  
for
open MPI. This leads me to believe this is not an Open MPI bug,  
but more

probably a libtool/ld issue.

I'll switch to linux for my devel for now, but if you have any  
success

story...

Aurelien

Le 19 sept. 08 à 15:20, Jeff Squyres a écrit :


I get the same problem on my MBP with 10.5.5.  However, I'm running
the gcc from hpc.sf.net:

-
[15:16] rtp-jsquyres-8713:~/mpi % gcc --version
gcc (GCC) 4.3.0 20071026 (experiment

Re: [OMPI devel] gdb libmpi.dylib on Leopard

2008-09-19 Thread Aurélien Bouteiller

Ok,

I didn't forgot to rerun autogen.sh (I even erased the libltdl, and  
various libtool wrappers that are generated at autogen/configure  
time). I checked the link Ralf submitted to our attention. This is  
exactly the same problem, or at least the same symptoms. The last  
version of libtool runs dsymutil on the created .so/.dylib, but the  
bad thing is that dsymutil returns similar warning message about  
missing ".lax" files. Therefore, even running it manually on the .dsym  
does not help.


I upgraded (compiled my own copy) my gcc to 4.3.2 (you should do it  
too, Jeff, the experimental have been giving me headaches in the  
past). Now, I also have the same warning messages for internal libs of  
gcc than for open MPI. This leads me to believe this is not an Open  
MPI bug, but more probably a libtool/ld issue.


I'll switch to linux for my devel for now, but if you have any success  
story...


Aurelien

Le 19 sept. 08 à 15:20, Jeff Squyres a écrit :

I get the same problem on my MBP with 10.5.5.  However, I'm running  
the gcc from hpc.sf.net:


-
[15:16] rtp-jsquyres-8713:~/mpi % gcc --version
gcc (GCC) 4.3.0 20071026 (experimental)
...
-

Not the /usr/bin/gcc that ships with Leopard.  I don't know if that  
matters or not.


I'm using AC 2.63, AM 1.10.1, LT 2.2.6a with a fairly vanilla build  
of Open MPI:


./configure --prefix=/Users/jsquyres/bogus --disable-mpi-f77 -- 
enable-mpirun-prefix-by-default


Here's what happens -- I fire up an MPI program and it deadlocks.  I  
attach to an MPI process PID with gdb (I am using /usr/bin/gdb --  
the Leopard-shipped gdb).  I get oodles of messages like Aurelien's:


-
warning: Could not find object file "/data/jsquyres/svn/ompi/ 
ompi/.libs/libmpi.lax/libdatatype.a/convertor.o" - no debug  
information available for "convertor.c".
warning: Could not find object file "/data/jsquyres/svn/ompi/ 
ompi/.libs/libmpi.lax/libdatatype.a/copy_functions.o" - no debug  
information available for "copy_functions.c".
warning: Could not find object file "/data/jsquyres/svn/ompi/ 
ompi/.libs/libmpi.lax/libdatatype.a/copy_functions_heterogeneous.o"  
- no debug information available for "copy_functions_heterogeneous.c".

-----


On Sep 19, 2008, at 2:31 PM, Ralf Wildenhues wrote:


* Aurélien Bouteiller wrote on Fri, Sep 19, 2008 at 08:02:40PM CEST:
Thanks Ralf for the support. I upgraded to libtool 2.2.6 and it  
didn't
solved the problem though. Still looking for somebody to confirm  
that

its working or not working on their Mac.


Did you rerun autogen.sh?  All I know is that your report looks  
really

similar to <http://gcc.gnu.org/ml/gcc/2008-08/msg00054.html> and that
one is apparently solved with Libtool 2.2.6.

If yours is still broken, then some more details would be nice.

Cheers,
Ralf
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] gdb libmpi.dylib on Leopard

2008-09-19 Thread Aurélien Bouteiller
Thanks Ralf for the support. I upgraded to libtool 2.2.6 and it didn't  
solved the problem though. Still looking for somebody to confirm that  
its working or not working on their Mac.


Aurelien

Le 17 sept. 08 à 12:39, Ralf Wildenhues a écrit :


Hello Aurélien,

* Aurélien Bouteiller wrote on Wed, Sep 17, 2008 at 06:32:11PM CEST:
I have been facing a weird problem for several month now (I guess  
since I
upgraded from Tiger to Leopard). I am unable to debug Open MPI  
using gdb
on my mac. The problem comes from gdb not being able to load  
symbols from
the dynamic libraries of Open MPI. I receive a message "warning:  
Could

not find object file "/Users/bouteill/ompi/debug.build/
opal/.libs/libopen-pal.lax/libmca_memory.a/memory_base_close.o" - no
debug information available for "../../../../trunk/opal/mca/memory/
base/memory_base_close.c".". As you can see, the path to the object  
file
containing the symbols is not correct. It links to the temporary  
files

expanded during the final stage link. As those files do not exist
anymore, gdb gets confused.


I have a vague memory that this is fixed in Libtool 2.2.6.  If you're
using an older version, please retry bootstrapping OpenMPI with that
one.

Cheers,
Ralf
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] gdb libmpi.dylib on Leopard

2008-09-17 Thread Aurélien Bouteiller
I have been facing a weird problem for several month now (I guess  
since I upgraded from Tiger to Leopard). I am unable to debug Open MPI  
using gdb on my mac. The problem comes from gdb not being able to load  
symbols from the dynamic libraries of Open MPI. I receive a message  
"warning: Could not find object file "/Users/bouteill/ompi/debug.build/ 
opal/.libs/libopen-pal.lax/libmca_memory.a/memory_base_close.o" - no  
debug information available for "../../../../trunk/opal/mca/memory/ 
base/memory_base_close.c".". As you can see, the path to the object  
file containing the symbols is not correct. It links to the temporary  
files expanded during the final stage link. As those files do not  
exist anymore, gdb gets confused.


supposedly, the rpath option of libtool should take care of this and  
correct the path to the symbols. Is anybody successful at debugging  
Open MPI on Leopard ? Is this a bug of Open MPI or a bug in libtool/ 
gdb ?  Any known fix ?


Aurelien

--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







Re: [OMPI devel] PLM consistency: priority

2008-07-11 Thread Aurélien Bouteiller
We don't want the user to have to select by hand the best PML. The  
logic inside the current selection process selects the best pml for  
the underlying network. However changing the priority is pretty  
meaningless from the user's point of view. So while retaining the  
selection process including priorities, we might want to remove the  
priority parameter, and use only the pml=ob1,cm syntax from the user's  
point of view.


Aurelien

Le 11 juil. 08 à 10:56, Ralph H Castain a écrit :

Okay, another fun one. Some of the PLM modules use MCA params to  
adjust
their relative selection priority. This can lead to very unexpected  
behavior
as which module gets selected will depend on the priorities of the  
other

selectable modules - which changes from release to release as people
independently make adjustments and/or new modules are added.

Fortunately, this doesn't bite us too often since many environments  
only
support one module, and since there is nothing to tell the user that  
the plm

module whose priority they raised actually -didn't- get used!

However, in the interest of "least astonishment", some of us working  
on the
RTE had changed our coding approach to avoid this confusion. Given  
that we
have this nice mca component select logic that takes the specified  
module -
i.e., "-mca plm foo" always yields foo if it can run, or errors out  
if it
can't - then the safest course is to remove MCA params that adjust  
module
priorities and have the user simply tell us which module they want  
us to

use.

Do we want to make this consistent, at least in the PLM? Or do you  
want to

leave the user guessing? :-)

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r18804

2008-07-03 Thread Aurélien Bouteiller

Thanks Ralph, this fix does the trick.

Aurelien

Le 3 juil. 08 à 13:53, r...@osl.iu.edu a écrit :


Author: rhc
Date: 2008-07-03 13:53:37 EDT (Thu, 03 Jul 2008)
New Revision: 18804
URL: https://svn.open-mpi.org/trac/ompi/changeset/18804

Log:
Repair the MPI-2 dynamic operations. This includes:

1. repair of the linear and direct routed modules

2. repair of the ompi/pubsub/orte module to correctly init routes to  
the ompi-server, and correctly handle failure to correctly parse the  
provided ompi-server URI


3. modification of orterun to accept both "file" and "FILE" for  
designating where the ompi-server URI is to be found - purely a  
convenience feature


4. resolution of a message ordering problem during the connect/ 
accept handshake that allowed the "send-first" proc to attempt to  
send to the "recv-first" proc before the HNP had actually updated  
its routes.


Let this be a further reminder to all - message ordering is NOT  
guaranteed in the OOB


5. Repair the ompi/dpm/orte module to correctly init routes during  
connect/accept.


Reminder to all: messages sent to procs in another job family (i.e.,  
started by a different mpirun) are ALWAYS routed through the  
respective HNPs. As per the comments in orte/routed, this is  
REQUIRED to maintain connect/accept (where only the root proc on  
each side is capable of init'ing the routes), allow communication  
between mpirun's using different routing modules, and to minimize  
connections on tools such as ompi-server. It is all taken care of  
"under the covers" by the OOB to ensure that a route back to the  
sender is maintained, even when the different mpirun's are using  
different routed modules.


6. corrections in the orte/odls to ensure proper identification of  
daemons participating in a dynamic launch


7. corrections in build/nidmap to support update of an existing  
nidmap during dynamic launch


8. corrected implementation of the update_arch function in the ESS,  
along with consolidation of a number of ESS operations into base  
functions for easier maintenance. The ability to support info from  
multiple jobs was added, although we don't currently do so - this  
will come later to support further fault recovery strategies


9. minor updates to several functions to remove unnecessary and/or  
no longer used variables and envar's, add some debugging output, etc.


10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to  
true if the provided proc is a daemon


There is still more cleanup to be done for efficiency, but this at  
least works.


Tested on single-node Mac, multi-node SLURM via odin. Tests included  
connect/accept, publish/lookup/unpublish, comm_spawn,  
comm_spawn_multiple, and singleton comm_spawn.


Fixes ticket #1256



Added:
  trunk/orte/mca/ess/base/ess_base_nidmap.c
Removed:
  trunk/orte/mca/ess/base/ess_base_build_nidmap.c
Text files modified:
  trunk/ompi/attribute/attribute_predefined.c |13
  trunk/ompi/mca/dpm/base/base.h  | 1
  trunk/ompi/mca/dpm/base/dpm_base_null_fns.c | 5
  trunk/ompi/mca/dpm/base/dpm_base_open.c | 1
  trunk/ompi/mca/dpm/dpm.h| 7
  trunk/ompi/mca/dpm/orte/dpm_orte.c  |   494 +++ 
+++-

  trunk/ompi/mca/pubsub/orte/pubsub_orte.c|14
  trunk/ompi/proc/proc.c  | 1
  trunk/orte/mca/ess/alps/ess_alps_module.c   |   163  
+

  trunk/orte/mca/ess/base/Makefile.am | 2
  trunk/orte/mca/ess/base/base.h  |12
  trunk/orte/mca/ess/base/ess_base_get.c  | 9
  trunk/orte/mca/ess/base/ess_base_put.c  | 8
  trunk/orte/mca/ess/env/ess_env_module.c |   144  
+--

  trunk/orte/mca/ess/hnp/ess_hnp_module.c | 2
  trunk/orte/mca/ess/lsf/ess_lsf_module.c |   138  
+-
  trunk/orte/mca/ess/singleton/ess_singleton_module.c |   182 ++ 
+--
  trunk/orte/mca/ess/slurm/ess_slurm_module.c |   136  
+-

  trunk/orte/mca/ess/tool/ess_tool_module.c   | 2
  trunk/orte/mca/grpcomm/bad/grpcomm_bad_module.c |22 +
  trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c|13
  trunk/orte/mca/odls/base/odls_base_default_fns.c|52 ++--
  trunk/orte/mca/odls/base/odls_base_open.c   | 8
  trunk/orte/mca/odls/base/odls_private.h | 4
  trunk/orte/mca/rml/base/rml_base_receive.c  |21 +
  trunk/orte/mca/rml/rml_types.h  | 2
  trunk/orte/mca/routed/binomial/routed_binomial.c|   192 +++ 
++--
  trunk/orte/mca/routed/direct/routed_direct.c|   316 +++ 
+++--
  trunk/orte/mca/routed/linear/routed_linear.c|   198 +++ 
++--

  trunk/orte/runtime/orte_globals.h   |15 +
  

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Aurélien Bouteiller
The first approach sounds fair enough to me. We should avoid 2 and 3  
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in  
the BTL selection process. When using the complete PML selection, BTL  
would be initialized several times, leading to a variety of bugs.  
Eventually the PML selection should return to its old self, when the  
BTL bug gets fixed.


Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across  
something I
don't fully understand. It seems we have each process insert into  
the modex
the name of the PML module that it selected. Once the modex has  
exchanged

that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose  
different PML
modules and hence create an "abort" scenario. However, if I look  
inside the
PML's at their selection logic, I find that a proc can ONLY pick a  
module

other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,  
since the
mca param is propagated, ALL procs have no choice but to pick that  
same
module, so that can't cause us to abort (we will have already  
returned an

error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and  
that it is
other than "psm". In this case, the CM module will be selected  
because its

default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me  
that you
either have the required capability or you don't. I can see that in  
some
environments (e.g., rsh across unmanaged collections of machines),  
it might
be possible for someone to launch across a set of machines where  
some do and
some don't have the required support. However, in all other cases,  
this will

be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should  
feel free
to confirm or correct it), it seems to me that this could be  
streamlined via

one or more means:

1. at the most, we could have rank=0 add the PML module name to the  
modex,
and other procs simply check it against their own and return an  
error if
they differ. This accomplishes the identical functionality to what  
we have

today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by  
requiring the
user to specify the PML module if they want something other than the  
default
OB1. In this case, there can be no confusion over what each proc is  
to use.
The CM module will attempt to init the MTL - if it cannot do so,  
then the
job will return the correct error and tell the user that CM/MTL  
support is

unavailable.

3. we could again eliminate the info by not inserting it into the  
modex if
(a) the default PML module is selected, or (b) the user specified  
the PML
module to be used. In the first case, each proc can simply check to  
see if
they picked the default - if not, then we can insert the info to  
indicate
the difference. Thus, in the "standard" case, no info will be  
inserted.


In the second case, we will already get an error if the specified  
PML module
could not be used. Hence, the modex check provides no additional  
info or

value.

I understand the motivation to support automation. However, in this  
case,

the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be  
in order?


Ralph



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] "__printf__" attribute

2008-05-08 Thread Aurélien Bouteiller
They refer to the parameters of the function. In the example linked, 2  
means the fmt is the second argument of the function and 3 is the  
first variadic arg related to the fmt string.


Aurelien

Le 8 mai 08 à 18:24, Jeff Squyres a écrit :


Rainer --

What do the numeric arguments refer to in the attribute format stuff?
The wiki page has only one example, and it doesn't explain what these
numbers are:

https://svn.open-mpi.org/trac/ompi/wiki/CompilerAttributes

Thanks!

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r18303

2008-04-25 Thread Aurélien Bouteiller
To bounce on last George remark, currently when a job dies without  
unsubscribing a port with Unpublish(due to poor user programming,  
failure or abort), ompi-server keeps the reference forever and a new  
application can therefore not publish under the same name again. So I  
guess this is a good point to cleanup correctly all published/opened  
ports, when the application is ended (for whatever reason).


Another cool feature could be to have mpirun behave as an ompi-server,  
and publish a suitable URI if requested to do so (if the urifile does  
not exist yet ?). I know from the source code that mpirun is already  
including anything needed to offer this feature, exept the ability to  
provide a suitable URI.


  Aurelien

Le 25 avr. 08 à 19:19, George Bosilca a écrit :


Ralph,

Thanks for your concern regarding the level of compliance of our  
implementation of the MPI standard. I don't know who were the MPI  
gurus you talked with about this issue, but I can tell that for once  
the MPI standard is pretty clear about this.


As stated by Aurelien in his last email, using the plural in several  
sentences, strongly suggest that the status of port should not be  
implicitly modified by MPI_Comm_accept or MPI_Comm_connect.  
Moreover, in the beginning of the chapter in the MPI standard, it is  
specified that comm/accept work exactly as in TCP. In other words,  
once the port is opened it stay open until the user explicitly close  
it.


However, not all corner cases are addressed by the MPI standard.  
What happens on MPI_Finalize ... it's a good question. Personally, I  
think we should stick with the TCP similarities. The port should be  
not only closed by unpublished. This will solve all issues with  
people trying to lookup a port once the originator is gone.


 george.

On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:

As I said, it makes no difference to me. I just want to ensure that  
everyone

agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that  
the port
was left open mostly because the person who wrote the C-binding  
forgot to

close it. ;-)

So, you MPI folks: do we allow multiple connections against a  
single port,
and leave the port open until explicitly closed? If so, then do we  
generate
an error if someone calls MPI_Finalize without first closing the  
port? Or do

we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is  
completed?


Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>  
wrote:


Actually, the port was still left open forever before the change.  
The

bug damaged the port string, and it was not usable anymore, not only
in subsequent Comm_accept, but also in Close_port or Unpublish_name.

To more specifically answer to your open port concern, if the user
does not want to have an open port anymore, he should specifically
call MPI_Close_port and not rely on MPI_Comm_accept to close it.
Actually the standard suggests the exact contrary: section 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". Because
there is multiple clients AND multiple connections in that  
sentence, I

assume the port can be used in multiple accepts.

Aurelien

Le 25 avr. 08 à 16:53, Ralph Castain a écrit :

Hmmm...just to clarify, this wasn't a "bug". It was my  
understanding

per the
MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around
open, and
their plan was to close the port immediately after the connection  
was

established.

So dpm_orte was written to that specification. When I reorganized
the code,
I left the logic as it had been written - which was actually done  
by

the MPI
side of the house, not me.

I have no problem with making the change. However, since the
specification
was created on the MPI side, I just want to make sure that the MPI
folks all
realize this has now been changed. Obviously, if this change in  
spec

is
adopted, someone needs to make sure that the C and Fortran  
bindings -

do not-
close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu" <boute...@osl.iu.edu>  
wrote:



Author: bouteill
Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
New Revision: 18303
URL: https://svn.open-mpi.org/trac/ompi/changeset/18303

Log:
Fix a bug that rpevented to use the same port (as returned by
Open_port) for
several Comm_accept)


Text files modified:
trunk/ompi/mca/dpm/orte/dpm_orte.c |19 ++-
1 files changed, 10 insertions(+), 9 deletions(-)

Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
=
=
=
=
=
=
=
=
=
= 
= 
= 
==

--- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
+++ tr

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-08 Thread Aurélien Bouteiller

Still no luck here,

I launch those three processes :
term1$ ompi-server -d --report-uri URIFILE

term2$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1  
simple_accept


term3$ mpirun -mca routed unity -ompi-server file:URIFILE -np 1  
simple_connect


The output of ompi-server shows a successful publish and lookup. I get  
the correct port on the client side. However, the result is the same  
as when not using the Publish/Lookup mechanism: the connect fails  
saying the

port cannot be reached.

Found port < 1940389889.0;tcp:// 
160.36.252.99:49777;tcp6://2002:a024:ed65:9:21b:63ff:fecb: 
28:49778;tcp6://fec0::9:21b:63ff:fecb:28:49778;tcp6://2002:a024:ff7f: 
9:21b:63ff:fecb:28:49778:300 >
[abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message  
is attempting to be sent to a process whose contact information is  
unknown in file ../../../../../trunk/orte/mca/rml/oob/rml_oob_send.c  
at line 140
[abouteil.nomad.utk.edu:60339] [[29620,1],0] attempted to send to  
[[29608,1],0]
[abouteil.nomad.utk.edu:60339] [[29620,1],0] ORTE_ERROR_LOG: A message  
is attempting to be sent to a process whose contact information is  
unknown in file ../../../../../trunk/ompi/mca/dpm/orte/dpm_orte.c at  
line 455

[abouteil.nomad.utk.edu:60339] *** An error occurred in MPI_Comm_connect
[abouteil.nomad.utk.edu:60339] *** on communicator MPI_COMM_SELF
[abouteil.nomad.utk.edu:60339] *** MPI_ERR_UNKNOWN: unknown error
[abouteil.nomad.utk.edu:60339] *** MPI_ERRORS_ARE_FATAL (goodbye)

I took a look in the source code, and I think the problem comes from a  
conceptional mistake in MPI_Connect. The function "connect_accept" in  
dpm_orte.c takes a orte_process_name_t as the destination port. This  
structure only contains the jobid and the vpid (always set to 0, I  
guess meaning you plan to contact the HNP of that job). Obviously, if  
the accepting process does not share the same HNP with the connecting  
process, there is no way for the MPI_Comm_connect function to fill  
correctly this field. The all purpose of the port_name string is to  
provide a consistent way to access the remote endpoint without a  
complicated name resolution service. I think this function should take  
the port_name instead (the string returned by open_port) and contact  
directly with OOB this endpoint to get the contact informations it  
needs from there, and not from the local HNP.


Aurelien

Le 4 avr. 08 à 15:21, Ralph H Castain a écrit :
Okay, I have a partial fix in there now. You'll have to use -mca  
routed

unity as I still need to fix it for routed tree.

Couple of things:

1. I fixed the --debug flag so it automatically turns on the debug  
output
from the data server code itself. Now ompi-server will tell you when  
it is

accessed.

2. remember, we added an MPI_Info key that specifies if you want the  
data
stored locally (on your own mpirun) or globally (on the ompi- 
server). If you
specify nothing, there is a precedence built into the code that  
defaults to
"local". So you have to tell us that this data is to be published  
"global"

if you want to connect multiple mpiruns.

I believe Jeff wrote all that up somewhere - could be in an email  
thread,
though. Been too long ago for me to remember... ;-) You can look it  
up in

the code though as a last resort - it is in
ompi/mca/pubsub/orte/pubsub_orte.c.

Ralph



On 4/4/08 12:55 PM, "Ralph H Castain" <r...@lanl.gov> wrote:

Well, something got borked in here - will have to fix it, so this  
will

probably not get done until next week.


On 4/4/08 12:26 PM, "Ralph H Castain" <r...@lanl.gov> wrote:

Yeah, you didn't specify the file correctly...plus I found a bug  
in the code

when I looked (out-of-date a little in orterun).

I am updating orterun (commit soon) and will include a better help  
message
about the proper format of the orterun cmd-line option. The syntax  
is:


-ompi-server uri

or -ompi-server file:filename-where-uri-exists

Problem here is that you gave it a uri of "test", which means  
nothing. ;-)


Should have it up-and-going soon.
Ralph

On 4/4/08 12:02 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>  
wrote:



Ralph,

I've not been very successful at using ompi-server. I tried this :

xterm1$ ompi-server --debug-devel -d --report-uri test
[grosse-pomme.local:01097] proc_info: hnp_uri NULL
daemon uri NULL
[grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and  
running!



xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
Port name:
2285895681.0;tcp://192.168.0.101:50065;tcp:// 
192.168.0.150:50065:300


xterm3$ mpirun -ompi-server test  -np 1 simple_connect
--
Process rank 0 attempted to lookup from a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently ex

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-04 Thread Aurélien Bouteiller

Ralph,

I've not been very successful at using ompi-server. I tried this :

xterm1$ ompi-server --debug-devel -d --report-uri test
[grosse-pomme.local:01097] proc_info: hnp_uri NULL
daemon uri NULL
[grosse-pomme.local:01097] [[34900,0],0] ompi-server: up and running!


xterm2$ mpirun -ompi-server test -np 1 mpi_accept_test
Port name:
2285895681.0;tcp://192.168.0.101:50065;tcp://192.168.0.150:50065:300

xterm3$ mpirun -ompi-server test  -np 1 simple_connect
--
Process rank 0 attempted to lookup from a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.

--
[grosse-pomme.local:01122] *** An error occurred in MPI_Lookup_name
[grosse-pomme.local:01122] *** on communicator MPI_COMM_WORLD
[grosse-pomme.local:01122] *** MPI_ERR_NAME: invalid name argument
[grosse-pomme.local:01122] *** MPI_ERRORS_ARE_FATAL (goodbye)
--



The server code Open_port, and then PublishName. Looks like the  
LookupName function cannot reach the ompi-server. The ompi-server in  
debug mode does not show any output when a new event occurs (like when  
the server is launched). Is there something wrong in the way I use it ?


Aurelien

Le 3 avr. 08 à 17:21, Ralph Castain a écrit :
Take a gander at ompi/tools/ompi-server - I believe I put a man page  
in

there. You might just try "man ompi-server" and see if it shows up.

Holler if you have a question - not sure I documented it very  
thoroughly at

the time.


On 4/3/08 3:10 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>  
wrote:



Ralph,


I am using trunk. Is there a documentation for ompi-server ? Sounds
exactly like what I need to fix point 1.

Aurelien

Le 3 avr. 08 à 17:06, Ralph Castain a écrit :

I guess I'll have to ask the basic question: what version are you
using?

If you are talking about the trunk, there no longer is a "universe"
concept
anywhere in the code. Two mpiruns can connect/accept to each other
as long
as they can make contact. To facilitate that, we created an "ompi-
server"
tool that is supposed to be run by the sys-admin (or a user, doesn't
matter
which) on the head node - there are various ways to tell mpirun  
how to

contact the server, or it can self-discover it.

I have tested publish/lookup pretty thoroughly and it seems to  
work. I

haven't spent much time testing connect/accept except via
comm_spawn, which
seems to be working. Since that uses the same mechanism, I would  
have

expected connect/accept to work as well.

If you are talking about 1.2.x, then the story is totally different.

Ralph



On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
wrote:


Hi everyone,

I'm trying to figure out how complete is the implementation of
Comm_connect/Accept. I found two problematic cases.

1) Two different programs are started in two different mpirun. One
makes accept, the second one use connect. I would not expect
MPI_Publish_name/Lookup_name to work because they do not share the
HNP. Still I would expect to be able to connect by copying (with
printf-scanf) the port_name string generated by Open_port;  
especially

considering that in Open MPI, the port_name is a string containing
the
tcp address and port of the rank 0 in the server communicator.
However, doing so results in "no route to host" and the connecting
application aborts. Is the problem related to an explicit check of
the
universes on the accept HNP ? Do I expect too much from the MPI
standard ? Is it because my two applications does not share the  
same

universe ? Should we (re) add the ability to use the same universe
for
several mpirun ?

2) Second issue is when the program setup a port, and then accept
multiple clients on this port. Everything works fine for the first
client, and then accept stalls forever when waiting for the second
one. My understanding of the standard is that it should work: 5.4.2
states "it must call MPI_Open_port to establish a port [...] it  
must

call MPI_Comm_accept to accept connections from clients". I
understand
that for one MPI_Open_port I should be able to manage several MPI
clients. Am I understanding correctly the standard here and  
should we

fix this ?

Here is a copy of the non-working code for reference.

/*
* Copyright (c) 2004-2007 The Trustees of the University of
Tennessee.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include 
#include 
#include 

int main(int argc, char

Re: [OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Aurélien Bouteiller

Ralph,


I am using trunk. Is there a documentation for ompi-server ? Sounds  
exactly like what I need to fix point 1.


Aurelien

Le 3 avr. 08 à 17:06, Ralph Castain a écrit :
I guess I'll have to ask the basic question: what version are you  
using?


If you are talking about the trunk, there no longer is a "universe"  
concept
anywhere in the code. Two mpiruns can connect/accept to each other  
as long
as they can make contact. To facilitate that, we created an "ompi- 
server"
tool that is supposed to be run by the sys-admin (or a user, doesn't  
matter

which) on the head node - there are various ways to tell mpirun how to
contact the server, or it can self-discover it.

I have tested publish/lookup pretty thoroughly and it seems to work. I
haven't spent much time testing connect/accept except via  
comm_spawn, which

seems to be working. Since that uses the same mechanism, I would have
expected connect/accept to work as well.

If you are talking about 1.2.x, then the story is totally different.

Ralph



On 4/3/08 2:29 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>  
wrote:



Hi everyone,

I'm trying to figure out how complete is the implementation of
Comm_connect/Accept. I found two problematic cases.

1) Two different programs are started in two different mpirun. One
makes accept, the second one use connect. I would not expect
MPI_Publish_name/Lookup_name to work because they do not share the
HNP. Still I would expect to be able to connect by copying (with
printf-scanf) the port_name string generated by Open_port; especially
considering that in Open MPI, the port_name is a string containing  
the

tcp address and port of the rank 0 in the server communicator.
However, doing so results in "no route to host" and the connecting
application aborts. Is the problem related to an explicit check of  
the

universes on the accept HNP ? Do I expect too much from the MPI
standard ? Is it because my two applications does not share the same
universe ? Should we (re) add the ability to use the same universe  
for

several mpirun ?

2) Second issue is when the program setup a port, and then accept
multiple clients on this port. Everything works fine for the first
client, and then accept stalls forever when waiting for the second
one. My understanding of the standard is that it should work: 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". I  
understand

that for one MPI_Open_port I should be able to manage several MPI
clients. Am I understanding correctly the standard here and should we
fix this ?

Here is a copy of the non-working code for reference.

/*
 * Copyright (c) 2004-2007 The Trustees of the University of  
Tennessee.

 * All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char port[MPI_MAX_PORT_NAME];
int rank;
int np;


MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

if(rank)
{
MPI_Comm comm;
/* client */
MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Read port: %s\n", port);
MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,  
);


MPI_Send(, 1, MPI_INT, 0, 1, comm);
MPI_Comm_disconnect();
}
else
{
int nc = np - 1;
MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,
sizeof(MPI_Comm));
MPI_Request *reqs = (MPI_Request *) calloc(nc,
sizeof(MPI_Request));
int *event = (int *) calloc(nc, sizeof(int));
int i;

MPI_Open_port(MPI_INFO_NULL, port);
/*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
printf("Port name: %s\n", port);
for(i = 1; i < np; i++)
MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,
MPI_COMM_WORLD);

for(i = 0; i < nc; i++)
{
MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,
_nodes[i]);
printf("Accept %d\n", i);
MPI_Irecv([i], 1, MPI_INT, 0, 1, comm_nodes[i],
[i]);
printf("IRecv %d\n", i);
}
MPI_Close_port(port);
MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
for(i = 0; i < nc; i++)
{
printf("event[%d] = %d\n", i, event[i]);
MPI_Comm_disconnect(_nodes[i]);
    printf("Disconnect %d\n", i);
}
}

MPI_Finalize();
return EXIT_SUCCESS;
}




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321





___
devel mailing list
de...@open-mpi.o

[OMPI devel] MPI_Comm_connect/Accept

2008-04-03 Thread Aurélien Bouteiller

Hi everyone,

I'm trying to figure out how complete is the implementation of  
Comm_connect/Accept. I found two problematic cases.


1) Two different programs are started in two different mpirun. One  
makes accept, the second one use connect. I would not expect  
MPI_Publish_name/Lookup_name to work because they do not share the  
HNP. Still I would expect to be able to connect by copying (with  
printf-scanf) the port_name string generated by Open_port; especially  
considering that in Open MPI, the port_name is a string containing the  
tcp address and port of the rank 0 in the server communicator.  
However, doing so results in "no route to host" and the connecting  
application aborts. Is the problem related to an explicit check of the  
universes on the accept HNP ? Do I expect too much from the MPI  
standard ? Is it because my two applications does not share the same  
universe ? Should we (re) add the ability to use the same universe for  
several mpirun ?


2) Second issue is when the program setup a port, and then accept  
multiple clients on this port. Everything works fine for the first  
client, and then accept stalls forever when waiting for the second  
one. My understanding of the standard is that it should work: 5.4.2  
states "it must call MPI_Open_port to establish a port [...] it must  
call MPI_Comm_accept to accept connections from clients". I understand  
that for one MPI_Open_port I should be able to manage several MPI  
clients. Am I understanding correctly the standard here and should we  
fix this ?


Here is a copy of the non-working code for reference.

/*
 * Copyright (c) 2004-2007 The Trustees of the University of Tennessee.
 * All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */
#include 
#include 
#include 

int main(int argc, char *argv[])
{
char port[MPI_MAX_PORT_NAME];
int rank;
int np;


MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

if(rank)
{
MPI_Comm comm;
/* client */
MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0,  
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("Read port: %s\n", port);
MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, );

MPI_Send(, 1, MPI_INT, 0, 1, comm);
MPI_Comm_disconnect();
}
else
{
int nc = np - 1;
MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc,  
sizeof(MPI_Comm));
MPI_Request *reqs = (MPI_Request *) calloc(nc,  
sizeof(MPI_Request));

int *event = (int *) calloc(nc, sizeof(int));
int i;

MPI_Open_port(MPI_INFO_NULL, port);
/*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/
printf("Port name: %s\n", port);
for(i = 1; i < np; i++)
MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0,  
MPI_COMM_WORLD);


for(i = 0; i < nc; i++)
{
MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF,  
_nodes[i]);

printf("Accept %d\n", i);
MPI_Irecv([i], 1, MPI_INT, 0, 1, comm_nodes[i],  
[i]);

printf("IRecv %d\n", i);
}
MPI_Close_port(port);
MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE);
for(i = 0; i < nc; i++)
{
printf("event[%d] = %d\n", i, event[i]);
MPI_Comm_disconnect(_nodes[i]);
printf("Disconnect %d\n", i);
}
}

MPI_Finalize();
return EXIT_SUCCESS;
}




--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321







Re: [OMPI devel] Fault tolerance

2008-03-07 Thread Aurélien Bouteiller

We now use the errmgr.

Aurelien

Le 6 mars 08 à 13:38, Aurélien Bouteiller a écrit :


Aside of what Josh said, we are working right know at UTK on orted/MPI
recovery (without killing/respawning all). For now we had no use of
the errgmr, but I'm quite sure this would be the smartest  place to
put all the mechanisms we are trying now.

Aurelien
Le 6 mars 08 à 11:17, Ralph Castain a écrit :


Ah - ok, thanks for clarifying! I'm happy to leave it around, but
wasn't
sure if/where it fit into anyone's future plans.

Thanks
Ralph



On 3/6/08 9:13 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:


The checkpoint/restart work that I have integrated does not respond
to
failure at the moment. If a failures happens I want ORTE to  
terminate

the entire job. I will then restart the entire job from a checkpoint
file. This follows the 'all fall down' approach that users typically
expect when using a global C/R technique.

Eventually I want to integrate something better where I can respond
to
a failure with a recovery from inside ORTE. I'm not there yet, but
hopefully in the near future.

I'll let the UTK group talk about what they are doing with ORTE,
but I
suspect they will be taking advantage of the errmgr to help respond
to
failure and restart a single process.


It is important to consider in this context that we do *not* always
want ORTE to abort whenever it detects a process failure. This is  
the

default mode for MPI applications (MPI_ERRORS_ARE_FATAL), and should
be supported. But there is another mode in which we would like ORTE
to
keep running to conform with (MPI_ERRORS_RETURN):
http://www.mpi-forum.org/docs/mpi-11-html/node148.html

It is known that certain standards conformant MPI "fault tolerant"
programs do not work in Open MPI for various reasons some in the
runtime and some external. Here we are mostly talking about
disconnected fates of intra-communicator groups. I have a test in  
the

ompi-tests repository that illustrates this problem, but I do not
have
time to fix it at the moment.


So in short keep the errmgr around for now. I suspect we will be
using
it, and possibly tweaking it in the nearish future.

Thanks for the observation.

Cheers,
Josh

On Mar 6, 2008, at 10:44 AM, Ralph Castain wrote:


Hello

I've been doing some work on fault response within the system, and
finally
realized something I should probably have seen awhile back. Perhaps
I am
misunderstanding somewhere, so forgive the ignorance if so.

When we designed ORTE some time in the deep, dark past, we had
envisioned
that people might want multiple ways of responding to process  
faults

and/or
abnormal terminations. You might want to just abort the job,  
attempt

to
restart just that proc, attempt to restart the job, etc. To support
these
multiple options, and to provide a means for people to simply try
new ones,
we created the errmgr framework.

Our thought was that a process and/or daemon would call the errmgr
when we
detected something abnormal happening, and that the selected errmgr
component could then do whatever fault response was desired.

However, I now see that the fault tolerance mechanisms inside of
OMPI do not
seem to be using that methodology. Instead, we have hard-coded a
particular
response into the system.

If we configure without FT, we just abort the entire job since that
is the
only errmgr component that exists.

If we configure with FT, then we execute the hard-coded C/R
methodology.
This is built directly into the code, so there is no option as to
what
happens.

Is there a reason why the errmgr framework was not used? Did the FT
team
decide that this was not a useful tool to support multiple FT
strategies?
Can we modify it to better serve those needs, or is it simply not
feasible?

If it isn't going to be used for that purpose, then I might as well
remove
it. As things stand, there really is no purpose served by the  
errmgr

framework - might as well replace it with just a function call.

Appreciate any insights
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] OMPI and Mac Leopard

2008-02-23 Thread Aurélien Bouteiller
Trunk works fine in Leopard in both static and dso build. Didn't tried  
the tmp branch on Leopard  tough.



Aurelien
Le 22 févr. 08 à 23:17, Ralph Castain a écrit :

I have confirmed that my tmp branch now builds and works on the Mac  
Leopard
OS, at least on an Intel arch. It is really critical, however, that  
you
don't try to build statically on that system (trust me - hard  
experience).


I believe the trunk and older versions are having some problems under
Leopard. I haven't fully confirmed that, though I did see some strange
behavior on my test machine here, so it may not be entirely accurate.

I am waiting for just a couple of checks to be completed before  
merging the
branch to the trunk. Hopefully, the appropriate people will have a  
chance to
finish those checks over the next few days so we can do the merge  
next week.


Will keep you posted.
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] PML V will be enabled again

2008-02-08 Thread Aurélien Bouteiller

Hi everyone,

All the problems detected last time PML V has been enabled in trunk  
have been fixed.  We invite you to give it a try (add a .ompi_unignore  
in ompi/mca/pml/v) on your favorite platform and compilation options  
and report any issues you may encounter. If none are detected, we plan  
to remove the ignore tag on wed. feb. 6.


Thanks,
Aurelien


--
Dr. Aurélien Bouteiller
Sr. Research Associate - Innovative Computing Laboratory
Suite 350, 1122 Volunteer Boulevard
Knoxville, TN 37996
865 974 6321







Re: [OMPI devel] orte_ns_base_select failed: returned value -1 instead of ORTE_SUCCESS

2008-01-31 Thread Aurélien Bouteiller
I tried using a fresh trunk, same problem have occured.  Here is the  
complete configure line. I am using libtool 1.5.22 from fink.  
Otherwise everything is standard OS 10.5.


  $ ../trunk/configure --prefix=/Users/bouteill/ompi/build --enable- 
mpirun-prefix-by-default --disable-io-romio --enable-debug --enable- 
picky --enable-mem-debug --enable-mem-profile --enable-visibility -- 
disable-dlopen --disable-shared --enable-static


The error message generated by abort contains garbage (line numbers do  
not match anything in .c files and according to gdb the failure does  
not occur during ns initialization). This looks like a heap corruption  
or something as bad.


orterun (argc=4, argv=0xb81c) at ../../../../trunk/orte/tools/ 
orterun/orterun.c:529
529	cb_states = ORTE_PROC_STATE_TERMINATED |  
ORTE_PROC_STATE_AT_STG1;

(gdb) n
530	rc = orte_rmgr.spawn_job(apps, num_apps, , 0, NULL,  
job_state_callback, cb_states, );

(gdb) n
531	while (NULL != (item = opal_list_remove_first()))  
OBJ_RELEASE(item);

(gdb) n
** Stepping over inlined function code. **
532 OBJ_DESTRUCT();
(gdb) n
534 if (orterun_globals.do_not_launch) {
(gdb) n
539 OPAL_THREAD_LOCK(_globals.lock);
(gdb) n
541 if (ORTE_SUCCESS == rc) {
(gdb) n
542 while (!orterun_globals.exit) {
(gdb) n
543 opal_condition_wait(_globals.cond,
(gdb) n
[grosse-pomme.local:77335] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in  
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/ 
oob_base_init.c at line 74


Aurelien


Le 30 janv. 08 à 17:18, Ralph Castain a écrit :


Are you running on the trunk, or an earlier release?

If the trunk, then I suspect you have a stale library hanging  
around. I

build and run statically on Leopard regularly.


On 1/30/08 2:54 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>  
wrote:



I get a runtime error in static build on Mac OS 10.5 (automake 1.10,
autoconf 2.60, gcc-apple-darwin 4.01, libtool 1.5.22).

The error does not occur in dso builds, and everything seems to work
fine on Linux.

Here is the error log.

~/ompi$ mpirun -np 2 NetPIPE_3.6/NPmpi
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/
oob_base_init.c at line 74
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ns/proxy/
ns_proxy_component.c at line 222
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Error in file /
SourceCache/openmpi/openmpi-5/openmpi/orte/runtime/orte_init_stage1.c
at line 230
--
It looks like orte_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal  
failure;

here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ns_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
--
It looks like MPI_INIT failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's  
some

additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init_stage1 failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)



--
Dr. Aurélien Bouteiller
Sr. Research Associate - Innovative Computing Laboratory
Suite 350, 1122 Volunteer Boulevard
Knoxville, TN 37996
865 974 6321





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] orte_ns_base_select failed: returned value -1 instead of ORTE_SUCCESS

2008-01-30 Thread Aurélien Bouteiller
I get a runtime error in static build on Mac OS 10.5 (automake 1.10,  
autoconf 2.60, gcc-apple-darwin 4.01, libtool 1.5.22).


The error does not occur in dso builds, and everything seems to work  
fine on Linux.


Here is the error log.

~/ompi$ mpirun -np 2 NetPIPE_3.6/NPmpi
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in  
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/oob/base/ 
oob_base_init.c at line 74
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Bad parameter in  
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ns/proxy/ 
ns_proxy_component.c at line 222
[grosse-pomme.local:34247] [NO-NAME] ORTE_ERROR_LOG: Error in file / 
SourceCache/openmpi/openmpi-5/openmpi/orte/runtime/orte_init_stage1.c  
at line 230

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ns_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init_stage1 failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)



--
Dr. Aurélien Bouteiller
Sr. Research Associate - Innovative Computing Laboratory
Suite 350, 1122 Volunteer Boulevard
Knoxville, TN 37996
865 974 6321







Re: [OMPI devel] RES: v pml question

2008-01-29 Thread Aurélien Bouteiller
I just agree with Josh. We though about it a bit, and nothing should  
prevent to use both.


Aurelien
Le 29 janv. 08 à 15:01, Josh Hursey a écrit :


At the moment I do not plan on joining the crcpw and v_protocol.

However those two components may currently work just fine together.
They are both designed to wrap around whatever the 'selected' PML
happens to be. If you tried to do this, I would expect the PML call
stack to look something like the following:
PML_SEND -> v_protocol -> crcpw -> ob1/cm

But since I have not tried this out I cannot say for sure. Let us know
if you have any problems.

Cheers,
Josh

On Jan 23, 2008, at 4:55 PM, Leonardo Fialho wrote:


I'm testing the v protocol just now. Anybody have plans to do a
message
wrapper mixing crcpw and v_protocol?

Leonardo Fialho
University Autonoma of Barcelona


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Trunk borked

2008-01-29 Thread Aurélien Bouteiller

DSO build also fail.

../../../../../../trunk/ompi/contrib/vt/vt/vtlib/vt_comp_gnu.c:312:5:  
warning: "VT_BFD" is not defined
../../../../../../trunk/ompi/contrib/vt/vt/vtlib/vt_comp_gnu.c:312:5:  
warning: "VT_BFD" is not defined

/usr/bin/ld: cannot find -lz
collect2: ld returned 1 exit status
make[6]: *** [vtfilter] Error 1

Le 29 janv. 08 à 01:51, George Bosilca a écrit :

Look like VT do not correctly compute dependencies. A static build  
will fails if libz.a is not installed on the system.


/usr/bin/ld: cannot find -lz
collect2: ld returned 1 exit status
make[5]: *** [vtfilter] Error 1

 george.

On Jan 28, 2008, at 12:37 PM, Matthias Jurenz wrote:


Hello,

this problem should be fixed now...
It seems that the symbol '__pos' is not available on every  
platform. This isn't a problem, because

it's only used for a debug control message.

Regards,
Matthias


On Mo, 2008-01-28 at 09:41 -0500, Jeff Squyres wrote:


Doh - this is Solaris on x86?  I think Terry said Solaris/sparc was
tested...

VT guys -- can you check out what's going on?



On Jan 28, 2008, at 9:36 AM, Adrian Knoth wrote:

> On Mon, Jan 28, 2008 at 07:26:56AM -0700, Ralph H Castain wrote:
>
>> We seem to have a problem on the trunk this morning. I am  
building

>> on a
>
> There are more errors:
>
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function
> `fsetpos':
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:850: error:  
request

> for member `__pos' in something not a structure or union
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function
> `fsetpos64':
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:876: error:  
request

> for member `__pos' in something not a structure or union
> gmake[5]: *** [vt_iowrap.o] Error 1
> gmake[5]: Leaving directory
> `/tmp/ompi/build/SunOS-i86pc/ompi/ompi/contrib/vt/vt/vtlib'
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function
> `fsetpos':
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:850: error:  
request

> for member `__pos' in something not a structure or union
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c: In function
> `fsetpos64':
> /tmp/ompi/src/ompi/contrib/vt/vt/vtlib/vt_iowrap.c:876: error:  
request

> for member `__pos' in something not a structure or union
> gmake[5]: *** [vt_iowrap.o] Error 1
> gmake[5]: Leaving directory
> `/tmp/ompi/build/SunOS-i86pc/ompi/ompi/contrib/vt/vt/vtlib'
>
>
> Just my $0.02
>
> --
> Cluster and Metacomputing Working Group
> Friedrich-Schiller-Universität Jena, Germany
>
> private: http://adi.thur.de
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Matthias Jurenz,
Center for Information Services and
High Performance Computing (ZIH), TU Dresden,
Willersbau A106, Zellescher Weg 12, 01062 Dresden
phone +49-351-463-31945, fax +49-351-463-37773
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Fwd: === CREATE FAILURE ===

2008-01-24 Thread Aurélien Bouteiller
According to posix, tar should not limit the file name length. Only  
the v7 implementation of tar is limited to 99 characters. GNU tar has  
never been limited in the number of characters file names can have.  
You should check with tar --help that tar on your machine defaults to  
format=gnu or format=posix. If it defaults to format=v7 I am curious  
why. Are you using solaris ?


Aurelien

Le 24 janv. 08 à 15:18, Jeff Squyres a écrit :


I'm trying to replicate and getting a lot of these:

tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/
pessimist/vprotocol_pessimist_sender_based.c: file name is too long
(max 99); not dumped
tar: openmpi-1.3a1r17212M/ompi/mca/pml/v/vprotocol/mca/vprotocol/
pessimist/vprotocol_pessimist_component.c: file name is too long (max
99); not dumped

I'll bet that this is the real problem.  GNU tar on linux defaults to
99 characters max, and the _component.c filename is 102, for example.

Can you shorten your names?


On Jan 24, 2008, at 3:02 PM, George Bosilca wrote:


We cannot reproduce this one. A simple "make checkdist" exit long
before doing anything in the ompi directory. It is difficult to see
where exactly it fails, but it is somewhere in the opal directory. I
suspect the new carto framework ...

Thanks,
  george.

On Jan 24, 2008, at 7:12 AM, Jeff Squyres wrote:


Aurelien --

Can you fix please?  Last night's tests didn't run because of this
failure.


Begin forwarded message:


From: MPI Team 
Date: January 23, 2008 9:13:30 PM EST
To: test...@open-mpi.org
Subject: === CREATE FAILURE ===
Reply-To: de...@open-mpi.org


ERROR: Command returned a non-zero exist status
   make -j 4 distcheck

Start time: Wed Jan 23 21:00:08 EST 2008
End time:   Wed Jan 23 21:13:30 EST 2008

=
=
=
= 
===

[... previous lines snipped ...]
config.status: creating orte/mca/snapc/Makefile
config.status: creating orte/mca/snapc/full/Makefile
config.status: creating ompi/mca/allocator/Makefile
config.status: creating ompi/mca/allocator/basic/Makefile
config.status: creating ompi/mca/allocator/bucket/Makefile
config.status: creating ompi/mca/bml/Makefile
config.status: creating ompi/mca/bml/r2/Makefile
config.status: creating ompi/mca/btl/Makefile
config.status: creating ompi/mca/btl/gm/Makefile
config.status: creating ompi/mca/btl/mx/Makefile
config.status: creating ompi/mca/btl/ofud/Makefile
config.status: creating ompi/mca/btl/openib/Makefile
config.status: creating ompi/mca/btl/portals/Makefile
config.status: creating ompi/mca/btl/sctp/Makefile
config.status: creating ompi/mca/btl/self/Makefile
config.status: creating ompi/mca/btl/sm/Makefile
config.status: creating ompi/mca/btl/tcp/Makefile
config.status: creating ompi/mca/btl/udapl/Makefile
config.status: creating ompi/mca/coll/Makefile
config.status: creating ompi/mca/coll/basic/Makefile
config.status: creating ompi/mca/coll/inter/Makefile
config.status: creating ompi/mca/coll/self/Makefile
config.status: creating ompi/mca/coll/sm/Makefile
config.status: creating ompi/mca/coll/tuned/Makefile
config.status: creating ompi/mca/common/Makefile
config.status: creating ompi/mca/common/mx/Makefile
config.status: creating ompi/mca/common/portals/Makefile
config.status: creating ompi/mca/common/sm/Makefile
config.status: creating ompi/mca/crcp/Makefile
config.status: creating ompi/mca/crcp/coord/Makefile
config.status: creating ompi/mca/io/Makefile
config.status: creating ompi/mca/io/romio/Makefile
config.status: creating ompi/mca/mpool/Makefile
config.status: creating ompi/mca/mpool/rdma/Makefile
config.status: creating ompi/mca/mpool/sm/Makefile
config.status: creating ompi/mca/mtl/Makefile
config.status: creating ompi/mca/mtl/mx/Makefile
config.status: creating ompi/mca/mtl/portals/Makefile
config.status: creating ompi/mca/mtl/psm/Makefile
config.status: creating ompi/mca/osc/Makefile
config.status: creating ompi/mca/osc/pt2pt/Makefile
config.status: creating ompi/mca/osc/rdma/Makefile
config.status: creating ompi/mca/pml/Makefile
config.status: creating ompi/mca/pml/cm/Makefile
config.status: creating ompi/mca/pml/crcpw/Makefile
config.status: creating ompi/mca/pml/dr/Makefile
config.status: creating ompi/mca/pml/ob1/Makefile
config.status: creating ompi/mca/pml/v/vprotocol/Makefile
config.status: error: cannot find input file: ompi/mca/pml/v/
vprotocol/pessimist/Makefile.in
make: *** [distcheck] Error 1
=
=
=
= 
===


Your friendly daemon,
Cyrador
___
testing mailing list
test...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/testing



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org

Re: [OMPI devel] RES: v pml question

2008-01-23 Thread Aurélien Bouteiller

Hi,

Actually it might already work. We never tried yet but nothing should  
prevent it.


The symlinks are necessary to trick the autogen and configure stages.  
This is required to avoid code replication from autogen.sh. If you  
look carefully you will see that the simlinks are created only inside  
the build directory, and not in the source directory. Thus it does not  
help to add them to the trunk.


Aurelien

Le 23 janv. 08 à 16:55, Leonardo Fialho a écrit :

I'm testing the v protocol just now. Anybody have plans to do a  
message

wrapper mixing crcpw and v_protocol?

Leonardo Fialho
University Autonoma of Barcelona

-Mensagem original-
De: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]  
Em nome

de Jeff Squyres
Enviada em: miércoles, 23 de enero de 2008 22:45
Para: Open Developers
Assunto: [OMPI devel] v pml question

Just curious: what are the "mca" and "vprotocol" symlinks to "." for  
in the

v/vprotocol directory for?

If they're necessary, can they be committed to svn?  If they're not
necessary, can they be removed?

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] trunk breakage

2008-01-23 Thread Aurélien Bouteiller

Should be fixed with r17184. Thanks for the quick bug report !

Aurelien

Le 23 janv. 08 à 14:08, Jeff Squyres a écrit :


The vprotocol pml does not compile for me.


make[4]: Entering directory `/home/jsquyres/svn/ompi2/ompi/mca/pml/v/
vprotocol/pessimist'
/bin/sh ../../../../../../libtool --tag=CC   --mode=compile gcc -
DHAVE_CONFIG_H -I. -I../../../../../../opal/include -
I../../../../../../orte/include -I../../../../../../ompi/include -
I../../../../../../opal/mca/paffinity/linux/plpa/src/libplpa   -
I../../../../../..-g -Wall -Wundef -Wno-long-long -Wsign-compare -
Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-
implicit-function-declaration -finline-functions -fno-strict- 
aliasing -

pthread -MT mca_vprotocol_pessimist_la-
vprotocol_pessimist_sender_based.lo -MD -MP -MF .deps/
mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.Tpo -c -o
mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.lo `test -
f 'vprotocol_pessimist_sender_based.c' || echo
'./'`vprotocol_pessimist_sender_based.c
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../../../opal/
include -I../../../../../../orte/include -I../../../../../../ompi/
include -I../../../../../../opal/mca/paffinity/linux/plpa/src/ 
libplpa -

I../../../../../.. -g -Wall -Wundef -Wno-long-long -Wsign-compare -
Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-
implicit-function-declaration -finline-functions -fno-strict- 
aliasing -

pthread -MT mca_vprotocol_pessimist_la-
vprotocol_pessimist_sender_based.lo -MD -MP -MF .deps/
mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.Tpo -c
vprotocol_pessimist_sender_based.c  -fPIC -DPIC -o .libs/
mca_vprotocol_pessimist_la-vprotocol_pessimist_sender_based.o
vprotocol_pessimist_sender_based.c: In function `sb_mmap_alloc':
vprotocol_pessimist_sender_based.c:94: error: `MAP_NOCACHE' undeclared
(first use in this function)
vprotocol_pessimist_sender_based.c:94: error: (Each undeclared
identifier is reported only once
vprotocol_pessimist_sender_based.c:94: error: for each function it
appears in.)
make[4]: *** [mca_vprotocol_pessimist_la-
vprotocol_pessimist_sender_based.lo] Error 1
make[4]: Leaving directory `/home/jsquyres/svn/ompi2/ompi/mca/pml/v/
vprotocol/pessimist'
make[3]: *** [all-recursive] Error 1


On Jan 23, 2008, at 12:27 PM, boute...@osl.iu.edu wrote:


Author: bouteill
Date: 2008-01-23 12:27:23 EST (Wed, 23 Jan 2008)
New Revision: 17182
URL: https://svn.open-mpi.org/trac/ompi/changeset/17182

Log:
removed ignore, as the code is robust enough to avoid interfering
with others
Removed:
 trunk/ompi/mca/pml/v/.ompi_ignore
 trunk/ompi/mca/pml/v/.ompi_unignore

Deleted: trunk/ompi/mca/pml/v/.ompi_ignore
=
=
=
=
=
=
=
=
= 
=


Deleted: trunk/ompi/mca/pml/v/.ompi_unignore
=
=
=
=
=
=
=
=
= 
=

--- trunk/ompi/mca/pml/v/.ompi_unignore 2008-01-23 12:27:23 EST
(Wed, 23 Jan 2008)
+++ (empty file)
@@ -1 +0,0 @@
-bouteill
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] [OMPI svn] svn:open-mpi r17177

2008-01-23 Thread Aurélien Bouteiller

Undefined symbols:
  "_opal_carto_base_components_opened", referenced from:
  _opal_carto_base_components_opened$non_lazy_ptr in components.o
  "_opal_carto_base_open", referenced from:
  ompi_info::open_components()  in components.o
  "_opal_carto_base_close", referenced from:
  ompi_info::close_components()  in components.o
ld: symbol(s) not found
collect2: ld returned 1 exit status
make[3]: *** [ompi_info] Error 1

I think you forgot one file in Makefile.am ;)

Aurelien

Le 23 janv. 08 à 04:20, shar...@osl.iu.edu a écrit :


Author: sharonm
Date: 2008-01-23 04:20:34 EST (Wed, 23 Jan 2008)
New Revision: 17177
URL: https://svn.open-mpi.org/trac/ompi/changeset/17177

Log:
Move the carto framework to the trunk.

Added:
  trunk/opal/class/opal_graph.c   (contents, props changed)
  trunk/opal/class/opal_graph.h   (contents, props changed)
  trunk/opal/mca/carto/
  trunk/opal/mca/carto/Makefile.am   (contents, props changed)
  trunk/opal/mca/carto/auto_detect/
  trunk/opal/mca/carto/auto_detect/Makefile.am
  trunk/opal/mca/carto/auto_detect/carto_auto_detect.h
  trunk/opal/mca/carto/auto_detect/carto_auto_detect_component.c
  trunk/opal/mca/carto/auto_detect/carto_auto_detect_module.c
  trunk/opal/mca/carto/auto_detect/configure.params
  trunk/opal/mca/carto/base/
  trunk/opal/mca/carto/base/Makefile.am   (contents, props changed)
  trunk/opal/mca/carto/base/base.h   (contents, props changed)
  trunk/opal/mca/carto/base/carto_base_close.c   (contents, props  
changed)

  trunk/opal/mca/carto/base/carto_base_graph.c
  trunk/opal/mca/carto/base/carto_base_graph.h
  trunk/opal/mca/carto/base/carto_base_open.c   (contents, props  
changed)
  trunk/opal/mca/carto/base/carto_base_select.c   (contents, props  
changed)
  trunk/opal/mca/carto/base/static-components.h   (contents, props  
changed)

  trunk/opal/mca/carto/carto.h   (contents, props changed)
  trunk/opal/mca/carto/file/
  trunk/opal/mca/carto/file/Makefile.am   (contents, props changed)
  trunk/opal/mca/carto/file/carto_file.h   (contents, props changed)
  trunk/opal/mca/carto/file/carto_file_component.c   (contents,  
props changed)

  trunk/opal/mca/carto/file/carto_file_lex.c
  trunk/opal/mca/carto/file/carto_file_lex.h
  trunk/opal/mca/carto/file/carto_file_lex.l
  trunk/opal/mca/carto/file/carto_file_module.c   (contents, props  
changed)
  trunk/opal/mca/carto/file/configure.params   (contents, props  
changed)

  trunk/opal/mca/carto/file/help-opal-carto-file.txt
  trunk/test/carto/
  trunk/test/carto/carto-file
  trunk/test/carto/carto_test.c
Text files modified:
  trunk/ompi/runtime/ompi_mpi_finalize.c   | 3 +++
  trunk/ompi/runtime/ompi_mpi_init.c   |11 +++
  trunk/ompi/tools/ompi_info/components.cc | 6 ++
  trunk/ompi/tools/ompi_info/ompi_info.cc  | 1 +
  trunk/opal/class/Makefile.am | 2 ++
  trunk/orte/tools/orterun/orterun.c   | 3 +++
  6 files changed, 26 insertions(+), 0 deletions(-)


Diff not shown due to size (183702 bytes).
To see the diff, run the following command:

svn diff -r 17176:17177 --no-diff-deleted

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn