[OMPI users] Memchecker and Wait

2009-08-11 Thread Allen Barnett
Hi:
I'm trying to use the memchecker/valgrind capability of OpenMPI 1.3.3 to
help debug my MPI application. I noticed a rather odd thing: After
Waiting on a Recv Request, valgrind declares my receive buffer as
invalid memory. Is this just a fluke of valgrind, or is OMPI doing
something internally?

This is on a 64-bit RHEL 5 system using GCC 4.3.2 and Valgrind 3.4.1.

Here is an example:
--
#include 
#include 
#include "mpi.h"

int main(int argc, char *argv[])
{
  int rank, size;

  MPI_Init(, );
  MPI_Comm_size(MPI_COMM_WORLD, );
  MPI_Comm_rank(MPI_COMM_WORLD, );

  if ( size !=  2 ) {
if ( rank == 0 )
  printf("Please run with 2 processes.\n");
MPI_Finalize();
return 1;
  }

  if (rank == 0) {
char buffer_in[100];
MPI_Request req_in;
MPI_Status status;
memset( buffer_in, 1, sizeof(buffer_in) );
MPI_Recv_init( buffer_in, 100, MPI_CHAR, 1, 123, MPI_COMM_WORLD,
_in );
MPI_Start( _in );
printf( "Before wait: %p: %d\n", buffer_in, buffer_in[3] );
printf( "Before wait: %p: %d\n", buffer_in, buffer_in[4] );
MPI_Wait( _in,  );
printf( "After wait: %p: %d\n", buffer_in, buffer_in[3] );
printf( "After wait: %p: %d\n", buffer_in, buffer_in[4] );
MPI_Request_free( _in );
  }
  else {
char buffer_out[100];
memset( buffer_out, 2, sizeof(buffer_out) );
MPI_Send( buffer_out, 100, MPI_CHAR, 0, 123, MPI_COMM_WORLD );
  }

  MPI_Finalize();
  return 0;
} 
--

Doing "mpirun -np 2 -mca btl ^sm valgrind ./a.out" yields:

Before wait: 0x7ff0003b0: 1
Before wait: 0x7ff0003b0: 1
==15487== 
==15487== Invalid read of size 1
==15487==at 0x400C6B: main (waittest.c:30)
==15487==  Address 0x7ff0003b3 is on thread 1's stack
After wait: 0x7ff0003b0: 2
==15487== 
==15487== Invalid read of size 1
==15487==at 0x400C8B: main (waittest.c:31)
==15487==  Address 0x7ff0003b4 is on thread 1's stack
After wait: 0x7ff0003b0: 2

Also, if I run this program with the shared memory BTL active, valgrind
reports several "conditional jump or move depends on uninitialized
value"s in the SM BTL and about 24k lost bytes at the end (mostly from
allocations in MPI_Init).

Thanks,
Allen

-- 
Allen Barnett
Transpire, Inc
E-Mail: al...@transpireinc.com
Skype:  allenbarnett



Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Gus Correa

Hi Jody

Jody Klymak wrote:


On Aug 11, 2009, at  17:35 PM, Gus Correa wrote:

You can check this, say, by logging in to each node and doing 
/usr/local/openmpi/bin/ompi_info and comparing the output.



Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009.



Did you wipe off the old directories before reinstalling?
I had bad surprises by just running make again.
It is safer to cleanup, run configure, run make, run make install
all over again.

I prefer to install on a NFS mounted directory,
and set the user environment (PATH, MANPATH, LD_LIBRARY_PATH, etc)
to search that directory before it looks for standard ones (such as 
/usr/local).

This ensures consistency on all nodes with a single installation,
no need to install on all nodes.
For clusters with a modest number of nodes this scales fine.
On different clusters I have used names such as /home/software,
/share/apps (Rocks cluster), etc,
for the main NFS mounted directory that
holds MPI and other applications,
and lives on the head node or on a storage node.
A lot of people do this.

Another thing to look at is what is in your .bashrc/.tcshrc file,
whether it doesn't contain anything that may point to a different 
OpenMPI, modify the PATH mistakenly, etc.

I don't know about Mac OS-X, but in Linux the files in /etc/profile.d
often also set the user environment, and if they're wrong,
they can produce funny results.
Do you have any MPI related files there?

What about passwords?  ssh from server to node is passwordless, but do 
the nodes need to be passwordless as well?  i.e. is xserve01 trying to 
ssh to xserve02?




I would say so.
At least that is what we have on three Linux clusters.
passwordless ssh across any pair of nodes.
However, I would guess if this were not working,
other MPI versions wouldn't work either.

In any case:

Have you tried to ssh from node to node on all possible pairs?

Do you have the public RSA key for all nodes on 
/etc/ssh/ssh_known_hosts2 (on all nodes)?


Anyway, not sure what else I can do to debug this. I'm considering 
rolling back to 1.1.5 and living without a queue manager...




How could you roll back to 1.1.5,
now that you overwrote the directories?

Hang in there!
The problem can be sorted out.

Launching jobs with Torque is way much better than
using barebones mpirun.
You can queue up a sequence of MITgcm runs,
say one year each, each job pending on the correct
completion of the previous job, and just watch the results come out.
This and other features of resource managers
are very convenient, and you don't want to miss that.
If there is more than one user, then a resource manager is a must.
And you don't want to stay behind with the OpenMPI versions
and improvements either.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Jody Klymak


On Aug 11, 2009, at  17:35 PM, Gus Correa wrote:

You can check this, say, by logging in to each node and doing /usr/ 
local/openmpi/bin/ompi_info and comparing the output.



Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009.

What about passwords?  ssh from server to node is passwordless, but do  
the nodes need to be passwordless as well?  i.e. is xserve01 trying to  
ssh to xserve02?


Anyway, not sure what else I can do to debug this. I'm considering  
rolling back to 1.1.5 and living without a queue manager...



Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/






Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Gus Correa

Hi Jody

Are you sure you have the same OpenMPI version installed on 
/usr/local/openmpi on *all* nodes?


The fact that the programs run on the xserver0, but hang
when you try xserver0 and xserver1 together suggest
some inconsistency in the runtime environment,
which may come from different OpenMPI versions.

You can check this, say, by logging in to each node and doing 
/usr/local/openmpi/bin/ompi_info and comparing the output.


Anyway, this is just a guess.

Gus Correa


Jody Klymak wrote:

Hello,


On Aug 11, 2009, at  8:15 AM, Ralph Castain wrote:

You can turn off those mca params I gave you as you are now past that 
point. I know there are others that can help debug that TCP btl error, 
but they can help you there.


Just to eliminate the mitgcm from the debugging I compiled 
example/hello_c.c and run as:


 /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01 
hello_c >& hello_c4_1host.txt


There is no ostensible problem.  If I run as:

/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host 
xserve01,xserve02 hello_c >& hello_c4_2host.txt


The process says Hello, but hangs at the end, and needs to be killed 
with ^C.


I then modified connectivity_c to include a printf as MPI is 
initialized, and hardwired verbose=1.  This completes, and appears to 
work fine..


/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01 
connectivity_c >& connectivity_c8_1host.txt


However, again, two hosts sours the mix:

/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host 
xserve01,xserve02 connectivity_c >& connectivity_c8_2host.txt


This hangs, and after waiting a minute or so we see that rank 0--4 on 
xserve01 cannot contact rank 5 (presumably on xserve02).  

It seems that I have something wrong in my tcp setup, but communication 
between these servers worked yesterday using 1.1.5, and ping etc all 
work fine, so something else is up.  Some sort of port permissions?  


Th most glaring error I see in these is:

[xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay - recipient 
list is empty!


I see reference in the archives to a similar error where "contacts.txt" 
could not be found.  I've had trouble with 10.5.7 with temporary 
directories, so maybe that is the issue?


Thanks Jody















--
Jody Klymak
http://web.uvic.ca/~jklymak/








___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Ralph Castain

I can't speak to the tcp problem, but the following:

[xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay -  
recipient list is empty!


is not an error message. It is perfectly normal operation.

Ralph


On Aug 11, 2009, at 1:54 PM, Jody Klymak wrote:


Hello,


On Aug 11, 2009, at  8:15 AM, Ralph Castain wrote:

You can turn off those mca params I gave you as you are now past  
that point. I know there are others that can help debug that TCP  
btl error, but they can help you there.


Just to eliminate the mitgcm from the debugging I compiled example/ 
hello_c.c and run as:


 /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01  
hello_c >& hello_c4_1host.txt


There is no ostensible problem.  If I run as:

/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host  
xserve01,xserve02 hello_c >& hello_c4_2host.txt


The process says Hello, but hangs at the end, and needs to be killed  
with ^C.


I then modified connectivity_c to include a printf as MPI is  
initialized, and hardwired verbose=1.  This completes, and appears  
to work fine..


/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01  
connectivity_c >& connectivity_c8_1host.txt


However, again, two hosts sours the mix:

/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host  
xserve01,xserve02 connectivity_c >& connectivity_c8_2host.txt


This hangs, and after waiting a minute or so we see that rank 0--4  
on xserve01 cannot contact rank 5 (presumably on xserve02).


It seems that I have something wrong in my tcp setup, but  
communication between these servers worked yesterday using 1.1.5,  
and ping etc all work fine, so something else is up.  Some sort of  
port permissions?


Th most glaring error I see in these is:

[xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay -  
recipient list is empty!


I see reference in the archives to a similar error where  
"contacts.txt" could not be found.  I've had trouble with 10.5.7  
with temporary directories, so maybe that is the issue?


Thanks Jody







--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Tuned collectives: How to choose them dynamically? (-mca coll_tuned_dynamic_rules_filename dyn_rules)"

2009-08-11 Thread Gus Correa

Hi Pavel, Lenny, Igor, list

Igor: Thanks for the pointer to your notes/paper!

Lenny: Thanks for resurrecting this thread!

Pavel: Thanks for the article!
It clarified a number of things about tuned collectives
(e.g. fixed vs. dynamic selection),
and the example rule file is very helpful too.

However, after reading the article and browsing through the code,
I still don't get what is that you call "segment size".
The article ranks "segment size" as an important parameter controlling
factor for good performance of collectives (section 5.1).
At first I thought "segment size" was the TCP/IP 'maximum segment size', 
the MTU (or its IB equivalent) minus headers, etc.

However, the article apparently says it is not this (section 5.1).

What is "segment size" in OpenMPI?
Can the "segment size" be directly or indirectly controlled by the user?

On the other hand, the example rule file always has
topo and segmentation = 0.
Why?

Thank you,
Gus Correa

Pavel Shamis (Pasha) wrote:

Lenny,
You can find some details here:
http://icl.cs.utk.edu/news_pub/submissions/Flex-collective-euro-pvmmpi-2006.pdf 



Pasha


Lenny Verkhovsky wrote:

Hi,
I am looking too for a file example of rules for dynamic collectives,
Have anybody tried it ? Where can I find a proper syntax for it ?
 
thanks.

Lenny.


 
On Thu, Jul 23, 2009 at 3:08 PM, Igor Kozin > wrote:


Hi Gus,
I played with collectives a few months ago. Details are here

http://www.cse.scitech.ac.uk/disco/publications/WorkingNotes.ConnectX.pdf

That was in the context of 1.2.6

You can get available tuning options by doing
ompi_info -all -mca coll_tuned_use_dynamic_rules 1 | grep alltoall
and similarly for other collectives.

Best,
Igor

2009/7/23 Gus Correa >:
> Dear OpenMPI experts
>
> I would like to experiment with the OpenMPI tuned collectives,
> hoping to improve the performance of some programs we run
> in production mode.
>
> However, I could not find any documentation on how to select the
> different collective algorithms and other parameters.
> In particular, I would love to read an explanation clarifying
> the syntax and meaning of the lines on "dyn_rules"
> file that is passed to
> "-mca coll_tuned_dynamic_rules_filename ./dyn_rules"
>
> Recently there was an interesting discussion on the list
> about this topic.  It showed that choosing the right collective
> algorithm can make a big difference in overall performance:
>
> http://www.open-mpi.org/community/lists/users/2009/05/9355.php
> http://www.open-mpi.org/community/lists/users/2009/05/9399.php
> http://www.open-mpi.org/community/lists/users/2009/05/9401.php
> http://www.open-mpi.org/community/lists/users/2009/05/9419.php
>
> However, the thread was concentrated on "MPI_Alltoall".
> Nothing was said about other collective functions.
> Not much was said about the
> "tuned collective dynamic rules" file syntax,
> the meaning of its parameters, etc.
>
> Is there any source of information about that which I missed?
> Thank you for any pointers or clarifications.
>
> Gus Correa
>
-
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
>
-
> ___
> users mailing list
> us...@open-mpi.org 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Automated tuning tool

2009-08-11 Thread Gus Correa

Thank you, John Casu and Edgar Gabriel for the pointers
to the parameter space sweep script and the OTPO code.

For simplicity,
I was thinking of testing each tuned collective separately,
instead of the applications, to have an idea
of which algorithms and parameters are best for our small cluster,
on a range of message and communicator sizes.

We have several applications, different problem sizes,
different number of processes, etc,
and all use a bunch of different collectives, besides point-to-point.

Gus Correa

john casu wrote:
> I'm not sure that there is a general "best set" of parameters, given the
> dependence of that set on comms patterns, etc...
>
> Still, this *is* a classic parameter sweep and optimization problem
> (unlike ATLAS), with a small number of parameters, and is the sort of
> thing one should be able to hook up fairly easily in a python script
> connected to a batch scheduler. Especially since you'd be likely to
> submit and run either a single job, or a number of equal sized jobs in
> parallel.
>
> In fact, here is a python script that works with SGE
> http://www.cs.umass.edu/~swarm/index.php?n=Sge.Py
>
> Now, you'd just have to choose the app, or apps that are important to you
>
>

Edgar Gabriel wrote:

Gus Correa wrote:

Terry Frankcombe wrote:

There's been quite some discussion here lately about the effect of OMPI
tuning parameters, particularly w.r.t. collectives.

Is there some tool to probe performance on any given network/collection
of nodes to aid optimisation of these parameters?

(I'm thinking something along the philosophy of ATLAS.)


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Hi Terry

We are also looking for this holy grail.

So far I found this 2008 reference to a certain
"Open Tool for Parameter Optimization (OTPO)":

http://www.springerlink.com/content/h5162153l184r7p0/

OTPO defines itself as this:

"OTPO systematically tests large numbers of combinations of Open MPI’s 
run-time tunable parameters for common communication patterns and 
performance metrics to determine the “best” set for a given platform."


you can checkout the OTPO code at

http://svn.open-mpi.org/svn/otpo/trunk/

It supports as of now netpipe and skampi collectives for tuning. It is 
far from perfect, but it is a starting point. If there are any issues,

please let us know...

Thanks
Edgar



However, I couldn't find any reference to the actual code or scripts,
and whether it is available, tested, free, downloadable, etc.

At this point I am doing these performance
tests in a laborious and inefficient manual way,
when I have the time to do it.

As some of the aforementioned article authors
are list subscribers (and OpenMPI developers),
maybe they can shed some light about OTPO, tuned collective 
optimization, OpenMPI runtime parameter optimization, etc.


IMHO, this topic deserves at least a FAQ.

Developers, Jeff:  Any suggestions?  :)

Many thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] need help with a code segment

2009-08-11 Thread Jeff Squyres

Are you including ?

I notice you have a -D for OMPI_MPI_ -- perhaps  is only  
included if you -DLAM_MPI_...?  (that's a total guess)



On Aug 11, 2009, at 4:18 PM, Borenstein, Bernard S wrote:

I'm trying to build a code with OPENMPI 1.3.3 that compiles with LAM/ 
MPI.


It is using mpicc and here is the code segment and error :

void drt_pll_init(int my_rank,int num_processors);
#ifdef DRT_USE_MPI
#include 
MPI_Comm drt_pll_mpi_split_comm_world(int key);
#else
int drt_pll_mpi_split_comm_world(int key);
#endif

/fltapps/boeing/cfd/mpi/openmpi1.3.3_intel91_64/bin/mpicc -I/fltapps/ 
boeing/cf
mpi/openmpi1.3.3_intel91_64/include -DDRT_PARALLEL -DDRT_USE_MPI - 
DPRECISION=2
-O -I../../P3Dlib/src -I/include  -I/fltusr/borensbs/local/include - 
DOMPI_MPI_

   -c -o drt_dv_app.o drt_dv_app.c
drt_Lib.h(336): error: identifier "MPI_Comm" is undefined
  MPI_Comm drt_pll_mpi_split_comm_world(int key);
  ^

compilation aborted for drt_dv_app.c (code 2)
make[1]: *** [drt_dv_app.o] Error 2

Hope someone can help

Bernie Borenstein
The Boeing Company



__ Information from ESET NOD32 Antivirus, version of virus  
signature database 4326 (20090811) __


The message was checked by ESET NOD32 Antivirus.

http://www.eset.com


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com



[OMPI users] need help with a code segment

2009-08-11 Thread Borenstein, Bernard S
I'm trying to build a code with OPENMPI 1.3.3 that compiles with
LAM/MPI.

It is using mpicc and here is the code segment and error :

void drt_pll_init(int my_rank,int num_processors);
#ifdef DRT_USE_MPI
#include 
MPI_Comm drt_pll_mpi_split_comm_world(int key);
#else
int drt_pll_mpi_split_comm_world(int key);
#endif

/fltapps/boeing/cfd/mpi/openmpi1.3.3_intel91_64/bin/mpicc
-I/fltapps/boeing/cf
mpi/openmpi1.3.3_intel91_64/include -DDRT_PARALLEL -DDRT_USE_MPI
-DPRECISION=2
-O -I../../P3Dlib/src -I/include  -I/fltusr/borensbs/local/include
-DOMPI_MPI_
   -c -o drt_dv_app.o drt_dv_app.c
drt_Lib.h(336): error: identifier "MPI_Comm" is undefined
  MPI_Comm drt_pll_mpi_split_comm_world(int key);
  ^

compilation aborted for drt_dv_app.c (code 2)
make[1]: *** [drt_dv_app.o] Error 2

Hope someone can help

Bernie Borenstein
The Boeing Company


 

__ Information from ESET NOD32 Antivirus, version of virus
signature database 4326 (20090811) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 


[OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Jody Klymak
Hello,On Aug 11, 2009, at  8:15 AM, Ralph Castain wrote:You can turn off those mca params I gave you as you are now past that point. I know there are others that can help debug that TCP btl error, but they can help you there.Just to eliminate the mitgcm from the debugging I compiled example/hello_c.c and run as: /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01 hello_c >& hello_c4_1host.txtThere is no ostensible problem.  If I run as:/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01,xserve02 hello_c >& hello_c4_2host.txtThe process says Hello, but hangs at the end, and needs to be killed with ^C.I then modified connectivity_c to include a printf as MPI is initialized, and hardwired verbose=1.  This completes, and appears to work fine../usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01 connectivity_c >& connectivity_c8_1host.txtHowever, again, two hosts sours the mix:/usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01,xserve02 connectivity_c >& connectivity_c8_2host.txtThis hangs, and after waiting a minute or so we see that rank 0--4 on xserve01 cannot contact rank 5 (presumably on xserve02).  It seems that I have something wrong in my tcp setup, but communication between these servers worked yesterday using 1.1.5, and ping etc all work fine, so something else is up.  Some sort of port permissions?  Th most glaring error I see in these is:[xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay - recipient list is empty!I see reference in the archives to a similar error where "contacts.txt" could not be found.  I've had trouble with 10.5.7 with temporary directories, so maybe that is the issue?Thanks Jody[saturna.cluster:19174] progressed_wait: base/plm_base_launch_support.c 459
Daemon was launched on xserve01.cluster - beginning to initialize
Daemon [[28401,0],1] checking in as pid 43824 on host xserve01.cluster
Daemon [[28401,0],1] not using static ports
[saturna.cluster:19174] defining message event: base/plm_base_launch_support.c 
423
[xserve01.cluster:43824] [[28401,0],1] orted: up and running - waiting for 
commands!
[saturna.cluster:19174] defining message event: grpcomm_bad_module.c 183
[saturna.cluster:19174] progressed_wait: base/plm_base_launch_support.c 712
[saturna.cluster:19174] [[28401,0],0] orte:daemon:cmd:processor called by 
[[28401,0],0] for tag 1
[saturna.cluster:19174] [[28401,0],0] node[0].name saturna daemon 0 arch 
ffc90200
[saturna.cluster:19174] [[28401,0],0] node[1].name xserve01 daemon 1 arch 
ffc90200
[saturna.cluster:19174] [[28401,0],0] orted_cmd: received add_local_procs
[saturna.cluster:19174] defining message event: base/odls_base_default_fns.c 
1219
[saturna.cluster:19174] [[28401,0],0] orte:daemon:send_relay
[saturna.cluster:19174] [[28401,0],0] orte:daemon:send_relay sending relay msg 
to 1
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: received message from 
[[28401,0],0]
[xserve01.cluster:43824] defining message event: orted/orted_comm.c 159
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: reissued recv
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:cmd:processor called by 
[[28401,0],0] for tag 1
[xserve01.cluster:43824] [[28401,0],1] node[0].name saturna daemon 0 arch 
ffc90200
[xserve01.cluster:43824] [[28401,0],1] node[1].name xserve01 daemon 1 arch 
ffc90200
[xserve01.cluster:43824] [[28401,0],1] orted_cmd: received add_local_procs
[saturna.cluster:19174] defining message event: base/plm_base_launch_support.c 
668
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:send_relay
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:send_relay - recipient list 
is empty!
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: received message from 
[[28401,1],0]
[xserve01.cluster:43824] defining message event: orted/orted_comm.c 159
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: reissued recv
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:cmd:processor called by 
[[28401,1],0] for tag 1
[xserve01.cluster:43824] [[28401,0],1] orted_recv: received sync+nidmap from 
local proc [[28401,1],0]
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:cmd:processor: processing 
commands completed
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: received message from 
[[28401,1],1]
[xserve01.cluster:43824] defining message event: orted/orted_comm.c 159
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: reissued recv
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:cmd:processor called by 
[[28401,1],1] for tag 1
[xserve01.cluster:43824] [[28401,0],1] orted_recv: received sync+nidmap from 
local proc [[28401,1],1]
[xserve01.cluster:43824] [[28401,0],1] orte:daemon:cmd:processor: processing 
commands completed
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: received message from 
[[28401,1],2]
[xserve01.cluster:43824] defining message event: orted/orted_comm.c 159
[xserve01.cluster:43824] [[28401,0],1] orted_recv_cmd: reissued recv
[xserve01.cluster:43824] [[28401,0],1] 

Re: [OMPI users] problem configuring with torque

2009-08-11 Thread Gus Correa

Hi Craig, list

On my Rocks 4.3 cluster Torque is installed on /opt/torque,
not on /share/apps/torque.

That directory path may have changed on more recent versions of Rocks,
or you may have installed another copy of
of Torque on /share/apps/torque.
However, have you checked where Torque is installed?

My $0.02

Gus Correa

Ralph Castain wrote:


On Aug 10, 2009, at 10:36 PM, Craig Plaisance wrote:

I am building openmpi on a cluster running rocks.  When I build using 
./configure --with-tm=/share/apps/torque 
--prefix=/share/apps/openmpi/intel I receive the warning
configure: WARNING: Unrecognized options: --with-tm, 
--enable-ltdl-convenience


You can ignore these - there are some secondary operations going on that 
don't understand the options used in the general build.


After running make and make install, I run ompi-info | grep tm and 
don't see the entries


   MCA pls: tm (MCA v1.0, API v1.0, Component v1.0)
   MCA ras: tm (MCA v1.0, API v1.0, Component v1.0)


I assume you are building a 1.3.x version? If so, the pls framework no 
longer exists, which is why you don't see it. You should see a plm tm 
module, though.


If you aren't seeing a ras tm module, then it is likely that the system 
didn't find the required Torque support. Are you sure the given path is 
correct?


Note that the ras interface version has been bumped up, so it wouldn't 
show MCA v1.0 etc. - the numbers should be different now.





Any idea what is happening?  Thanks!
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Well, it now is launching just fine, so that's one thing! :-)

Afraid I'll have to let the TCP btl guys take over from here. It looks like
everything is up and running, but something strange is going on in the MPI
comm layer.

You can turn off those mca params I gave you as you are now past that point.
I know there are others that can help debug that TCP btl error, but they can
help you there.

Ralph


On Tue, Aug 11, 2009 at 8:54 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:
>
>  This means that OMPI is finding an mca_iof_proxy.la file at run time from
>> a prior version of Open MPI.  You might want to use "find" or "locate" to
>> search your nodes and find it.  I suspect that you somehow have an OMPI
>> 1.3.x install that overlaid an install of a prior OMPI version installation.
>>
>
>
> OK, right you were - the old file was in my new install directory.  I
> didn't erase /usr/local/openmpi before re-running the install...
>
> However, after reinstalling on the nodes (but not cleaning out /usr/lib on
> all the nodes) I still have the following:
>
> Thanks,  Jody
>
>
> saturna.cluster:17660] mca:base:select:(  plm) Querying component [rsh]
> [saturna.cluster:17660] mca:base:select:(  plm) Query of component [rsh]
> set priority to 10
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [slurm]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [slurm].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [tm]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [tm].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [xgrid]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [xgrid].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Selected component [rsh]
> [saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660 nodename
> hash 1656374957
> [saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811
> [saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm
> [saturna.cluster:17660] mca:base:select:( odls) Querying component
> [default]
> [saturna.cluster:17660] mca:base:select:( odls) Query of component
> [default] set priority to 1
> [saturna.cluster:17660] mca:base:select:( odls) Selected component
> [default]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1]
> [saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job [24811,1]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash)
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote shell
> as local shell
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash)
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv:
>/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export
> PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid  -mca
> orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710
> ;tcp://192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose
> 5
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve01
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon
> [[24811,0],1]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh)
> [/usr/bin/ssh xserve01  PATH=/usr/local/openmpi/bin:$PATH ; export PATH ;
> LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
> 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710;tcp://
> 192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose 5]
> Daemon was launched on xserve01.cluster - beginning to initialize
> [xserve01.cluster:42519] mca:base:select:( odls) Querying component
> [default]
> [xserve01.cluster:42519] mca:base:select:( odls) Query of component
> [default] set priority to 1
> [xserve01.cluster:42519] mca:base:select:( odls) Selected component
> [default]
> Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster
> Daemon [[24811,0],1] not using static ports
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve02
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon
> [[24811,0],2]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh)
> [/usr/bin/ssh xserve02  PATH=/usr/local/openmpi/bin:$PATH ; export PATH ;
> LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 2 -mca orte_ess_num_procs
> 3 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Yeah, it's the lib confusion that's the problem - this is the problem:

[saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described
> vs non-described) mismatch - operation not allowed in file
> base/odls_base_default_fns.c at line 2475
>

Have you tried configuring with --enable-mpirun-prefix-by-default? This
would help avoid the confusion. You also should check your path to ensure
that it is correct as well (make sure that mpirun is the one you expect, and
that you are getting the corresponding remote orted).

Ralph

On Tue, Aug 11, 2009 at 8:23 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:
>
> Sigh - too early in the morning for this old brain, I fear...
>
> You are right - the ranks are fine, and local rank doesn't matter. It
> sounds like a problem where the TCP messaging is getting a message ack'd
> from someone other than the process that was supposed to recv the message.
> This should cause us to abort, but we were just talking on the phone that
> the abort procedure may not be working correctly. Or it could be (as Jeff
> suggests) that the version mismatch is also preventing us from properly
> aborting too.
>
> So I fear we are back to trying to find these other versions on your
> nodes...
>
>
> Well, the old version is still on the nodes (in /usr/lib as default for OS
> X)...
>
> I can try and clean those all out by hand but I'm still confused why the
> old version would be used - how does openMPI find the right library?
>
> Note again, that I get these MCA warnings on the server when just running
> ompi_info and I *have* cleaned out /usr/lib on the server.  So I really
> don't understand how on the server I can still have a library issue.  Is
> there a way to trace at runtime what library an executable is dynamically
> linking to?  Can I rebuild openmpi statically?
>
> Thanks,  Jody
>
>
>
>
> On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:
>
>>
>> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>>
>>  The reason your job is hanging is sitting in the orte-ps output. You have
>>> multiple processes declaring themselves to be the same MPI rank. That
>>> definitely won't work.
>>>
>>
>> Its the "local rank" if that makes any difference...
>>
>> Any thoughts on this output?
>>
>> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack]
>> received unexpected process identifier [[61029,1],3]
>>
>>  The question is why is that happening? We use Torque all the time, so we
>>> know that the basic support is correct. It -could- be related to lib
>>> confusion, but I can't tell for sure.
>>>
>>
>> Just to be clear, this is not going through torque at this point.  Its
>> just vanilla ssh, for which this code worked with 1.1.5.
>>
>>
>>  Can you rebuild OMPI with --enable-debug, and rerun the job with the
>>> following added to your cmd line?
>>>
>>> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>>>
>>
>> Working on this...
>>
>> Thanks,  Jody
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:

This means that OMPI is finding an mca_iof_proxy.la file at run time  
from a prior version of Open MPI.  You might want to use "find" or  
"locate" to search your nodes and find it.  I suspect that you  
somehow have an OMPI 1.3.x install that overlaid an install of a  
prior OMPI version installation.



OK, right you were - the old file was in my new install directory.  I  
didn't erase /usr/local/openmpi before re-running the install...


However, after reinstalling on the nodes (but not cleaning out /usr/ 
lib on all the nodes) I still have the following:


Thanks,  Jody


saturna.cluster:17660] mca:base:select:(  plm) Querying component [rsh]
[saturna.cluster:17660] mca:base:select:(  plm) Query of component  
[rsh] set priority to 10
[saturna.cluster:17660] mca:base:select:(  plm) Querying component  
[slurm]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[saturna.cluster:17660] mca:base:select:(  plm) Querying component [tm]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[tm]. Query failed to return a module
[saturna.cluster:17660] mca:base:select:(  plm) Querying component  
[xgrid]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[xgrid]. Query failed to return a module

[saturna.cluster:17660] mca:base:select:(  plm) Selected component [rsh]
[saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660  
nodename hash 1656374957

[saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811
[saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm
[saturna.cluster:17660] mca:base:select:( odls) Querying component  
[default]
[saturna.cluster:17660] mca:base:select:( odls) Query of component  
[default] set priority to 1
[saturna.cluster:17660] mca:base:select:( odls) Selected component  
[default]

[saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1]
[saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job  
[24811,1]

[saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash)
[saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote  
shell as local shell

[saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash)
[saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv:
	/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export  
PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ;  
export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons  
-mca ess env -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid  
 -mca orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp:// 
142.104.154.96:49710;tcp://192.168.2.254:49710" -mca plm_base_verbose  
5 -mca odls_base_verbose 5
[saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node  
xserve01
[saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of  
daemon [[24811,0],1]
[saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve01  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca  
orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri  
"1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - 
mca plm_base_verbose 5 -mca odls_base_verbose 5]

Daemon was launched on xserve01.cluster - beginning to initialize
[xserve01.cluster:42519] mca:base:select:( odls) Querying component  
[default]
[xserve01.cluster:42519] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve01.cluster:42519] mca:base:select:( odls) Selected component  
[default]

Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster
Daemon [[24811,0],1] not using static ports
[saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node  
xserve02
[saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of  
daemon [[24811,0],2]
[saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve02  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca  
orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri  
"1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - 
mca plm_base_verbose 5 -mca odls_base_verbose 5]

Daemon was launched on xserve02.local - beginning to initialize
[xserve02.local:42180] mca:base:select:( odls) Querying component  
[default]
[xserve02.local:42180] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve02.local:42180] mca:base:select:( odls) Selected component  
[default]

Daemon [[24811,0],2] checking in as pid 42180 on host xserve02.local
Daemon [[24811,0],2] not using 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:


Sigh - too early in the morning for this old brain, I fear...

You are right - the ranks are fine, and local rank doesn't matter.  
It sounds like a problem where the TCP messaging is getting a  
message ack'd from someone other than the process that was supposed  
to recv the message. This should cause us to abort, but we were just  
talking on the phone that the abort procedure may not be working  
correctly. Or it could be (as Jeff suggests) that the version  
mismatch is also preventing us from properly aborting too.


So I fear we are back to trying to find these other versions on your  
nodes...


Well, the old version is still on the nodes (in /usr/lib as default  
for OS X)...


I can try and clean those all out by hand but I'm still confused why  
the old version would be used - how does openMPI find the right library?


Note again, that I get these MCA warnings on the server when just  
running ompi_info and I *have* cleaned out /usr/lib on the server.  So  
I really don't understand how on the server I can still have a library  
issue.  Is there a way to trace at runtime what library an executable  
is dynamically linking to?  Can I rebuild openmpi statically?


Thanks,  Jody





On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:

On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

The reason your job is hanging is sitting in the orte-ps output. You  
have multiple processes declaring themselves to be the same MPI  
rank. That definitely won't work.


Its the "local rank" if that makes any difference...

Any thoughts on this output?


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected  
process identifier [[61029,1],3]


The question is why is that happening? We use Torque all the time,  
so we know that the basic support is correct. It -could- be related  
to lib confusion, but I can't tell for sure.


Just to be clear, this is not going through torque at this point.   
Its just vanilla ssh, for which this code worked with 1.1.5.




Can you rebuild OMPI with --enable-debug, and rerun the job with the  
following added to your cmd line?


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

Working on this...

Thanks,  Jody

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

I'm afraid the output will be a tad verbose, but I would appreciate  
seeing it. Might also tell us something about the lib issue.



Command line was:

/usr/local/openmpi/bin/mpirun -mca plm_base_verbose 5 --debug-daemons - 
mca odls_base_verbose 5 -n 16 --host xserve03,xserve04 ../build/mitgcmuv



Starting: ../results//TasGaussRestart16
[saturna.cluster:07360] mca:base:select:(  plm) Querying component [rsh]
[saturna.cluster:07360] mca:base:select:(  plm) Query of component  
[rsh] set priority to 10
[saturna.cluster:07360] mca:base:select:(  plm) Querying component  
[slurm]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[saturna.cluster:07360] mca:base:select:(  plm) Querying component [tm]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[tm]. Query failed to return a module
[saturna.cluster:07360] mca:base:select:(  plm) Querying component  
[xgrid]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[xgrid]. Query failed to return a module

[saturna.cluster:07360] mca:base:select:(  plm) Selected component [rsh]
[saturna.cluster:07360] plm:base:set_hnp_name: initial bias 7360  
nodename hash 1656374957

[saturna.cluster:07360] plm:base:set_hnp_name: final jobfam 14551
[saturna.cluster:07360] [[14551,0],0] plm:base:receive start comm
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras "mca_ras_xgrid"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca:base:select:( odls) Querying component  
[default]
[saturna.cluster:07360] mca:base:select:( odls) Query of component  
[default] set priority to 1
[saturna.cluster:07360] mca:base:select:( odls) Selected component  
[default]
[saturna.cluster:07360] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored

[saturna.cluster:07360] [[14551,0],0] plm:rsh: setting up job [14551,1]
[saturna.cluster:07360] [[14551,0],0] plm:base:setup_job for job  
[14551,1]

[saturna.cluster:07360] [[14551,0],0] plm:rsh: local shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: assuming same remote  
shell as local shell

[saturna.cluster:07360] [[14551,0],0] plm:rsh: remote shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: final template argv:
	/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export  
PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ;  
export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons  
-mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid  
 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 
142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose  
5 -mca odls_base_verbose 5
[saturna.cluster:07360] [[14551,0],0] plm:rsh: launching on node  
xserve03
[saturna.cluster:07360] [[14551,0],0] plm:rsh: recording launch of  
daemon [[14551,0],1]
[saturna.cluster:07360] [[14551,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve03  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca  
orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 
142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose  
5 -mca odls_base_verbose 5]

Daemon was launched on xserve03.local - beginning to initialize
[xserve03.local:40708] mca:base:select:( odls) Querying component  
[default]
[xserve03.local:40708] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve03.local:40708] mca:base:select:( odls) Selected component  
[default]
[xserve03.local:40708] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[xserve03.local:40708] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sigh - too early in the morning for this old brain, I fear...

You are right - the ranks are fine, and local rank doesn't matter. It sounds
like a problem where the TCP messaging is getting a message ack'd from
someone other than the process that was supposed to recv the message. This
should cause us to abort, but we were just talking on the phone that the
abort procedure may not be working correctly. Or it could be (as Jeff
suggests) that the version mismatch is also preventing us from properly
aborting too.

So I fear we are back to trying to find these other versions on your
nodes...


On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>
>  The reason your job is hanging is sitting in the orte-ps output. You have
>> multiple processes declaring themselves to be the same MPI rank. That
>> definitely won't work.
>>
>
> Its the "local rank" if that makes any difference...
>
> Any thoughts on this output?
>
> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack]
> received unexpected process identifier [[61029,1],3]
>
>  The question is why is that happening? We use Torque all the time, so we
>> know that the basic support is correct. It -could- be related to lib
>> confusion, but I can't tell for sure.
>>
>
> Just to be clear, this is not going through torque at this point.  Its just
> vanilla ssh, for which this code worked with 1.1.5.
>
>
>  Can you rebuild OMPI with --enable-debug, and rerun the job with the
>> following added to your cmd line?
>>
>> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>>
>
> Working on this...
>
> Thanks,  Jody
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres

On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote:


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process
identifier [[61029,1],3]



This would well be caused by a version mismatch between your nodes.   
E.g., if one node is running OMPI vx.y.z and another is running  
va.b.c.  We don't check for version mismatch in network  
communications, and our wire protocols have changed between versions.   
So if vx.y.z sends something that is not understood between va.b.c,  
something like the above message could occur.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

The reason your job is hanging is sitting in the orte-ps output. You  
have multiple processes declaring themselves to be the same MPI  
rank. That definitely won't work.


Its the "local rank" if that makes any difference...

Any thoughts on this output?

[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process  
identifier [[61029,1],3]


The question is why is that happening? We use Torque all the time,  
so we know that the basic support is correct. It -could- be related  
to lib confusion, but I can't tell for sure.


Just to be clear, this is not going through torque at this point.  Its  
just vanilla ssh, for which this code worked with 1.1.5.



Can you rebuild OMPI with --enable-debug, and rerun the job with the  
following added to your cmd line?


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5


Working on this...

Thanks,  Jody


[OMPI users] Error: system limit exceeded on number of pipes that can be open

2009-08-11 Thread Mike Dubman
Hello guys,


When executing following command with mtt and ompi 1.3.3:

mpirun --host 
witch15,witch15,witch15,witch15,witch16,witch16,witch16,witch16,witch17,witch17,witch17,witch17,witch18,witch18,witch18,witch18,witch19,witch19,witch19,witch19
-np 20   --mca btl_openib_use_srq 1  --mca btl self,sm,openib
~mtt/mtt-scratch/20090809140816_dellix8_11812/installs/mnum/tests/ibm/ibm/dynamic/loop_spawn


getting following errors:

parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
parent: MPI_Comm_spawn #40 return : 0
parent: MPI_Comm_spawn #60 return : 0
parent: MPI_Comm_spawn #80 return : 0
parent: MPI_Comm_spawn #100 return : 0
parent: MPI_Comm_spawn #120 return : 0
parent: MPI_Comm_spawn #140 return : 0
parent: MPI_Comm_spawn #160 return : 0
parent: MPI_Comm_spawn #180 return : 0
parent: MPI_Comm_spawn #200 return : 0
parent: MPI_Comm_spawn #220 return : 0
parent: MPI_Comm_spawn #240 return : 0
parent: MPI_Comm_spawn #260 return : 0
parent: MPI_Comm_spawn #280 return : 0
parent: MPI_Comm_spawn #300 return : 0
parent: MPI_Comm_spawn #320 return : 0
parent: MPI_Comm_spawn #340 return : 0
parent: MPI_Comm_spawn #360 return : 0
parent: MPI_Comm_spawn #380 return : 0
parent: MPI_Comm_spawn #400 return : 0
parent: MPI_Comm_spawn #420 return : 0
parent: MPI_Comm_spawn #440 return : 0
parent: MPI_Comm_spawn #460 return : 0
parent: MPI_Comm_spawn #480 return : 0
parent: MPI_Comm_spawn #500 return : 0
parent: MPI_Comm_spawn #520 return : 0
parent: MPI_Comm_spawn #540 return : 0
parent: MPI_Comm_spawn #560 return : 0
parent: MPI_Comm_spawn #580 return : 0
--
mpirun was unable to launch the specified application as it
encountered an error:

Error: system limit exceeded on number of pipes that can be open
Node: witch19

when attempting to start process rank 0.

This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
asking the system administrator for that node to increase the system limit, or
by rearranging your processes to place fewer of them on that node.




Do you know what OS param I should change in order to resolve it?

Thanks

Mike


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Oops - I should have looked at your output more closely. The component_find
warnings are clearly indicating some old libs laying around, but that isn't
why your job is hanging.

The reason your job is hanging is sitting in the orte-ps output. You have
multiple processes declaring themselves to be the same MPI rank. That
definitely won't work.

The question is why is that happening? We use Torque all the time, so we
know that the basic support is correct. It -could- be related to lib
confusion, but I can't tell for sure.

Can you rebuild OMPI with --enable-debug, and rerun the job with the
following added to your cmd line?

-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

I'm afraid the output will be a tad verbose, but I would appreciate seeing
it. Might also tell us something about the lib issue.

Thanks
Ralph


On Tue, Aug 11, 2009 at 7:22 AM, Ralph Castain  wrote:

> Sorry, but Jeff is correct - that error message clearly indicates a version
> mismatch. Somewhere, one or more of your nodes is still picking up an old
> version.
>
>
>
>
> On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres  wrote:
>
>> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>>
>>  I have removed all the OS-X -supplied libraries, recompiled and
>>> installed openmpi 1.3.3, and I am *still* getting this warning when
>>> running ompi_info:
>>>
>>> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
>>> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
>>> supported MCA v2.0.0) -- ignored
>>>
>>>
>> This means that OMPI is finding an mca_iof_proxy.la file at run time from
>> a prior version of Open MPI.  You might want to use "find" or "locate" to
>> search your nodes and find it.  I suspect that you somehow have an OMPI
>> 1.3.x install that overlaid an install of a prior OMPI version installation.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sorry, but Jeff is correct - that error message clearly indicates a version
mismatch. Somewhere, one or more of your nodes is still picking up an old
version.



On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres  wrote:

> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>
>  I have removed all the OS-X -supplied libraries, recompiled and
>> installed openmpi 1.3.3, and I am *still* getting this warning when
>> running ompi_info:
>>
>> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
>> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
>> supported MCA v2.0.0) -- ignored
>>
>>
> This means that OMPI is finding an mca_iof_proxy.la file at run time from
> a prior version of Open MPI.  You might want to use "find" or "locate" to
> search your nodes and find it.  I suspect that you somehow have an OMPI
> 1.3.x install that overlaid an install of a prior OMPI version installation.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres

On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:


I have removed all the OS-X -supplied libraries, recompiled and
installed openmpi 1.3.3, and I am *still* getting this warning when
running ompi_info:

[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
uses an MCA interface that is not recognized (component MCA v1.0.0 !=
supported MCA v2.0.0) -- ignored



This means that OMPI is finding an mca_iof_proxy.la file at run time  
from a prior version of Open MPI.  You might want to use "find" or  
"locate" to search your nodes and find it.  I suspect that you somehow  
have an OMPI 1.3.x install that overlaid an install of a prior OMPI  
version installation.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:

Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


I am still finding this very mysterious

I have removed all the OS-X -supplied libraries, recompiled and  
installed openmpi 1.3.3, and I am *still* getting this warning when  
running ompi_info:


[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras "mca_ras_xgrid"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: rcache  
"mca_rcache_rb" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored


So, I guess I'm not clear how the library can be an issue...

I *do* get another error from running the gcm that I do not get from  
running simpler jobs - hopefully this helps explain things:


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process  
identifier [[61029,1],3]


The processes are running, the mitgcmuv processes are running on the  
xserves, and using considerable resources!  They open STDERR/STDOUT  
but nothing is flushed into them, including the few print statements  
I've put in before and after MPI_INIT (as Ralph suggested).


On 11-Aug-09, at 4:17 AM, Ashley Pittman wrote:

If you suspect a hang then you can use the command orte-ps (on the  
node
where the mpirun is running) and it should show you your job.  This  
will

tell you if the job is started and still running or if there was a
problem launching.


/usr/local/openmpi/bin/orte-ps
[saturna.cluster:51840] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:51840] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored



Information from mpirun [61029,0]
---

JobID |   State |  Slots | Num Procs |
--
[61029,1] | Running |  2 |16 |
	 Process Name |  ORTE Name | Local Rank |PID | Node  
|   State |


---
	../build/mitgcmuv |  [[61029,1],0] |  0 |  40206 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],1] |  0 |  40005 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],2] |  1 |  40207 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],3] |  1 |  40006 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],4] |  2 |  40208 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],5] |  2 |  40007 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],6] |  3 |  40209 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],7] |  3 |  40008 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],8] |  4 |  40210 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],9] |  4 |  40009 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],10] |  5 |  40211 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],11] |  5 |  40010 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],12] |  6 |  40212 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],13] |  6 |  40011 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],14] |  7 |  40213 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],15] |  7 |  40012 | xserve04 |  
Running |


Thanks,  Jody





Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain


On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote:


On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:

If it isn't already there, try putting a print statement tight at
program start, another just prior to MPI_Init, and another just after
MPI_Init. It could be that something is hanging somewhere during
program startup since it sounds like everything is launching just
fine.


If you suspect a hang then you can use the command orte-ps (on the  
node
where the mpirun is running) and it should show you your job.  This  
will

tell you if the job is started and still running or if there was a
problem launching.

If the program did start and has really hung then you can get more
in-depth information about it using padb which is linked to in my
signature.


FWIW: we use padb for this purpose, and it is very helpful!

Ralph



Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ashley Pittman
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
> If it isn't already there, try putting a print statement tight at
> program start, another just prior to MPI_Init, and another just after
> MPI_Init. It could be that something is hanging somewhere during
> program startup since it sounds like everything is launching just
> fine.

If you suspect a hang then you can use the command orte-ps (on the node
where the mpirun is running) and it should show you your job.  This will
tell you if the job is started and still running or if there was a
problem launching.

If the program did start and has really hung then you can get more
in-depth information about it using padb which is linked to in my
signature.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



[OMPI users] How to make a job abort when one host dies?

2009-08-11 Thread Oskar Enoksson
I searched the FAQ and google but couldn't come up with a solution to 
this problem.


My problem is that when one MPI execution host dies or the network 
connection goes down the job is not aborted. Instead the remaining 
processes continue to eat 100% CPU indefinitely. How can I make jobs 
abort in these cases?


I use OpenMPI 1.3.2. We have a myrinet network and I use mtl/mx for mpi 
communication. We also use gridengine 6.2u3. The output from the running 
job indicates that the remaining processes detect a timeout trying to 
communicate with the (dead) host cl120.foi.se. But why do they not 
terminate after this failure?


Thanks.

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=1, endpoint=1, seqnum=0x2b8f
   matched_val: 0x0004000d_fff4
   slength=48, xfer_length=48
   seg: 0x7fffe11ff830,48
   caller: 0xdb

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=116, endpoint=1, seqnum=0x3726
   matched_val: 0x00040001_fff4
   slength=48, xfer_length=48
   seg: 0x7124b7b0,48
   caller: 0x9b

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=1, endpoint=0, seqnum=0x1048
   matched_val: 0x00040006_fff4
   slength=48, xfer_length=48
   seg: 0x7fffc6470eb0,48
   caller: 0x70

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=1, endpoint=1, seqnum=0xd53
   matched_val: 0x00040007_fff4
   slength=48, xfer_length=48
   seg: 0x1f54360,48
   caller: 0xda

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=116, endpoint=0, seqnum=0x376c
   matched_val: 0x0004_fff4
   slength=48, xfer_length=48
   seg: 0x82ec040,48
   caller: 0x12

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=1, endpoint=0, seqnum=0x2746
   matched_val: 0x0004000c_fff4
   slength=48, xfer_length=48
   seg: 0x1116f410,48
   caller: 0x30

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (1): send_small
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=1, endpoint=1, seqnum=0x18de
   matched_val: 0x00250001_fff4
   slength=104, xfer_length=104
   seg: 0x181c3100,104
   caller: 0x18

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/1
Aborted 2 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries reached (1000) for message
   type (2): send_medium
   state (0x14): buffered dead
   requeued: 1000 (timeout=501000ms)
   dest: 00:60:dd:49:78:59 (cl120.foi.se:0)
   partner: peer_index=116, endpoint=0, seqnum=0x3361
   matched_val: 0x0004000f_0010
   slength=7168, xfer_length=7168
   seg: 0x23e8a838,7168
   caller: 0x7e

Was trying to contact
   00:60:dd:49:78:59 (cl120.foi.se:0)/0
Aborted 1 send requests due to remote peer 00:60:dd:49:78:59 
(cl120.foi.se:0) disconnected

Max retransmit retries 

Re: [OMPI users] problem configuring with torque

2009-08-11 Thread Ralph Castain


On Aug 10, 2009, at 10:36 PM, Craig Plaisance wrote:

I am building openmpi on a cluster running rocks.  When I build  
using ./configure --with-tm=/share/apps/torque --prefix=/share/apps/ 
openmpi/intel I receive the warning
configure: WARNING: Unrecognized options: --with-tm, --enable-ltdl- 
convenience


You can ignore these - there are some secondary operations going on  
that don't understand the options used in the general build.


After running make and make install, I run ompi-info | grep tm and  
don't see the entries


   MCA pls: tm (MCA v1.0, API v1.0, Component v1.0)
   MCA ras: tm (MCA v1.0, API v1.0, Component v1.0)


I assume you are building a 1.3.x version? If so, the pls framework no  
longer exists, which is why you don't see it. You should see a plm tm  
module, though.


If you aren't seeing a ras tm module, then it is likely that the  
system didn't find the required Torque support. Are you sure the given  
path is correct?


Note that the ras interface version has been bumped up, so it wouldn't  
show MCA v1.0 etc. - the numbers should be different now.





Any idea what is happening?  Thanks!
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:

Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


Note that I always configure with --prefix=somewhere-in-my-own-dir,  
never to a system directory. Avoids this kind of confusion.


Yeah, I did configure --prefix=/usr/local/openmpi

What the errors are saying is that we are picking up components from  
a very old version of OMPI that is distributed by Apple. It may or  
may not be causing confusion for the system - hard to tell. However,  
the fact that it is the IO forwarding subsystem that is picking them  
up, and the fact that you aren't seeing any output from your job,  
makes me a tad suspicious.


Me too!

Can you run other jobs? In other words, do you get stdout/stderr  
from other programs you run, or does every MPI program hang (even  
simple ones)? If it is just your program, then it could just be that  
your application is hanging before any output is generated. Can you  
have it print something to stderr right when it starts?


No simple ones, like the examples I gave before, run fine, just with  
the suspicious warnings.


I'm running a big general circulation model (MITgcm).  Under normal  
conditions it spits something out almost right away, and that is not  
being done here.  STDOUT.0001 etc are all opened, but nothing is put  
into them.


I'm pretty sure I'm compliling the gcm properly:

otool -L mitgcmuv
mitgcmuv:
	/usr/local/openmpi/lib/libmpi_f77.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/local/openmpi/lib/libmpi.0.dylib (compatibility version 1.0.0,  
current version 1.0.0)
	/usr/local/openmpi/lib/libopen-rte.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/local/openmpi/lib/libopen-pal.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/lib/libutil.dylib (compatibility version 1.0.0, current version  
1.0.0)
	/usr/local/lib/libgfortran.3.dylib (compatibility version 4.0.0,  
current version 4.0.0)
	/usr/local/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current  
version 1.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current  
version 111.1.3)


Thanks,  Jody




On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote:



On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:

Check your LD_LIBRARY_PATH - there is an earlier version of OMPI  
in your path that is interfering with operation (i.e., it comes  
before your 1.3.3 installation).


H, The OS X faq says not to do this:

"Note that there is no need to add Open MPI's libdir to  
LD_LIBRARY_PATH; Open MPI's shared library build process  
automatically uses the "rpath" mechanism to automatically find the  
correct shared libraries (i.e., the ones associated with this  
build, vs., for example, the OS X-shipped OMPI shared libraries).  
Also note that we specifically do not recommend adding Open MPI's  
libdir to DYLD_LIBRARY_PATH."


http://www.open-mpi.org/faq/?category=osx

Regardless, if I set either, and run ompi_info I still get:

[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_svc" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored


echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
/usr/local/openmpi/lib: /usr/local/openmpi/lib:

So I'm afraid I'm stumped again.  I suppose I could go clean out  
all the libraries in /usr/lib/...


Thanks again, sorry to be a pain...

Cheers,  Jody






On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:


So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the  
MITgcm does not, either under ssh or torque.  It hangs at some  
early point in execution before anything is written, so its hard  
for me to tell what the error is.  Could these MCA warnings have  
anything to do with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so  
hopefully that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_xgrid" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_proxy" 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


rhc$ echo $PATH
/Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/ 
openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/ 
bin:/opt/local/bin:/opt/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/ 
local/bin:/usr/texbin


rhc$ echo $LD_LIBRARY_PATH
/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/X11R6/lib:/ 
usr/local/lib:


Note that I always configure with --prefix=somewhere-in-my-own-dir,  
never to a system directory. Avoids this kind of confusion.


What the errors are saying is that we are picking up components from a  
very old version of OMPI that is distributed by Apple. It may or may  
not be causing confusion for the system - hard to tell. However, the  
fact that it is the IO forwarding subsystem that is picking them up,  
and the fact that you aren't seeing any output from your job, makes me  
a tad suspicious.


Can you run other jobs? In other words, do you get stdout/stderr from  
other programs you run, or does every MPI program hang (even simple  
ones)? If it is just your program, then it could just be that your  
application is hanging before any output is generated. Can you have it  
print something to stderr right when it starts?



On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote:



On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:

Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in  
your path that is interfering with operation (i.e., it comes before  
your 1.3.3 installation).


H, The OS X faq says not to do this:

"Note that there is no need to add Open MPI's libdir to  
LD_LIBRARY_PATH; Open MPI's shared library build process  
automatically uses the "rpath" mechanism to automatically find the  
correct shared libraries (i.e., the ones associated with this build,  
vs., for example, the OS X-shipped OMPI shared libraries). Also note  
that we specifically do not recommend adding Open MPI's libdir to  
DYLD_LIBRARY_PATH."


http://www.open-mpi.org/faq/?category=osx

Regardless, if I set either, and run ompi_info I still get:

[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 ! 
= supported MCA v2.0.0) -- ignored


echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
/usr/local/openmpi/lib: /usr/local/openmpi/lib:

So I'm afraid I'm stumped again.  I suppose I could go clean out all  
the libraries in /usr/lib/...


Thanks again, sorry to be a pain...

Cheers,  Jody






On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:


So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the  
MITgcm does not, either under ssh or torque.  It hangs at some  
early point in execution before anything is written, so its hard  
for me to tell what the error is.  Could these MCA warnings have  
anything to do with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so  
hopefully that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_xgrid" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_svc" uses an MCA interface that is not recognized (co

mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored

==   ALLOCATED NODES   ==

Data for node: Name: xserve02.localNum slots: 8Max slots: 0
Data for node: Name: xserve01.localNum slots: 8Max slots: 0

=

   JOB MAP   

Data for node: Name: xserve02.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 0

Data for node: Name: xserve01.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 1

=