RE: [Oscar-users] RE: lamboot not found

Lefevre Jerome Mon, 21 Mar 2005 17:05:46 -0800

Hi Bernard,

Yes, with export LAMRSH='ssh -x' , all is right. I will rebuild my Lam-ifort with this flag. Many thanks for your tip.

Now, if i run "qsub mpihello.pbs", $PBS_NODEFILE list only 2 nodes.
         node3.cluster.ird.nc
         node2.cluster.ird.nc

Even if i place in my PBS script the flag "N" after mpirun.

If i run lamboot and mpirun -np 8 mpihello, my 4 nodes compute ?!.

I think PBS don't give me the correct "node chart".

With OSCAR, how PBS define PBS_Nodefile ? How to correct it ?

Many thanks

Ciao.
J�rome

A 15:19 21/03/2005 -0800, Bernard Li a �crit :

Hi Jerome:

You can either:

1) Turn on rsh on the cluster nodes (this is turned off by default on
OSCAR cluster)
2) Use ssh instead of rsh... export LAMRSH='ssh -x' (I think that's the
correct environment variable).

Cheers,

Bernard

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Lefevre Jerome
> Sent: Monday, March 21, 2005 15:05
> To: [email protected]
> Subject: [Oscar-users] RE: lamboot not found
>
> Hi,
>
> Now, my default MPI propagates cross the cluster, because my
> home was not mounted. I think my HomeDir was not mounted
> because i boot first nodes before cluster. Cexec mount -a and
> all is right now. "cexec switcher mpi"
> show me my new default MPI "Lam-oscar-7.0-ifort"
>
> To test my LAM configuration, i edit by hand
> Lam-7.0-ifort/etc/lam-bhost.def on my front-end with my nodes
> and frontend.
>
> However, if i type lamboot -v, Lam complains about :
> n-1<6228> ssi:boot:base:linear:booting n0
> (node1.cluster.ird.nc) ERROR : node1.cluster.ird.nc :
> connection refused
>
> The following told about rsh...
>
> I have a doubt : When I configure Lam-7.0.6 from source, i
> omit to specify configure --with-rsh="ssh -x" ? Is matter ?
>
> See below my output from "lamboot -d"
>
> Many thanks
>
> jerome
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<6372> ssi:boot:base: looking for boot schema in following
> directories:
> n-1<6372> ssi:boot:base:   <current directory>
> n-1<6372> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<6372> ssi:boot:base:   $LAMHOME/etc
> n-1<6372> ssi:boot:base:   /opt/lam-7.0-ifort/etc
> n-1<6372> ssi:boot:base: looking for boot schema file:
> n-1<6372> ssi:boot:base:   lam-bhost.def
> n-1<6372> ssi:boot:base: found boot schema:
> /opt/lam-7.0-ifort/etc/lam-bhost.defn-1<6372> ssi:boot:rsh:
> found the following hosts:
> n-1<6372> ssi:boot:rsh:   n0 node1.cluster.ird.nc (cpu=2)
> n-1<6372> ssi:boot:rsh:   n1 node2.cluster.ird.nc (cpu=2)
> n-1<6372> ssi:boot:rsh:   n2 node3.cluster.ird.nc (cpu=2)
> n-1<6372> ssi:boot:rsh:   n3 editr.cluster.ird.nc (cpu=2)
> n-1<6372> ssi:boot:rsh: resolved hosts:
> n-1<6372> ssi:boot:rsh:   n0 node1.cluster.ird.nc --> 192.168.150.1
> n-1<6372> ssi:boot:rsh:   n1 node2.cluster.ird.nc --> 192.168.150.2
> n-1<6372> ssi:boot:rsh:   n2 node3.cluster.ird.nc --> 192.168.150.3
> n-1<6372> ssi:boot:rsh:   n3 editr.cluster.ird.nc -->
> 192.168.150.50 (origin)
> n-1<6372> ssi:boot:rsh: starting RTE procs n-1<6372>
> ssi:boot:base:linear: starting n-1<6372>
> ssi:boot:base:server: opening server TCP socket n-1<6372>
> ssi:boot:base:server: opened port 37073 n-1<6372>
> ssi:boot:base:linear: booting n0 (node1.cluster.ird.nc)
> n-1<6372> ssi:boot:rsh: starting lamd on
> (node1.cluster.ird.nc) n-1<6372> ssi:boot:rsh: starting on n0
> (node1.cluster.ird.nc): hboot -t -c lam-conf.lamd -d -s -I
> "-H 192.168.150.50 -P 37073 -n 0 -o 3"
> n-1<6372> ssi:boot:rsh: launching remotely n-1<6372>
> ssi:boot:rsh: attempting to execute "rsh node1.cluster.ird.nc
> -n echo $SHELL"
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> node1.cluster.ird.nc: Connection refused
> --------------------------------------------------------------
> ---------------
> LAM failed to execute a process on the remote node
> "node1.cluster.ird.nc".
> LAM was not trying to invoke any LAM-specific commands yet --
> we were simply trying to determine what shell was being used
> on the remote host.
>
> LAM tried to use the remote agent command "rsh"
> to invoke "echo $SHELL" on the remote node.
>
> This usually indicates an authentication problem with the
> remote agent, or some other configuration type of error in
> your .cshrc or .profile file.  The following is a list of
> items that you may wish to check on the remote node:
>
>          - You have an account and can login to the remote machine
>          - Incorrect permissions on your home directory (should
>            probably be 0755)
>          - Incorrect permissions on your $HOME/.rhosts file
> (if you are
>            using rsh -- they should probably be 0644)
>          - You have an entry in the remote $HOME/.rhosts file (if you
>            are using rsh) for the machine and username that you are
>            running from
>          - Your .cshrc/.profile must not print anything out to the
>            standard error
>          - Your .cshrc/.profile should set a correct TERM type
>          - Your .cshrc/.profile should set the SHELL environment
>            variable to your default shell
>
> Try invoking the following command at the unix command line:
>
>          rsh node1.cluster.ird.nc -n echo $SHELL
>
> You will need to configure your local setup such that you
> will *not* be prompted for a password to invoke this command
> on the remote node.
> No output should be printed from the remote node before the
> output of the command is displayed.
>
> When you can get this command to execute successfully by
> hand, LAM will probably be able to function properly.
> --------------------------------------------------------------
> ---------------
> n-1<6372> ssi:boot:base:linear: Failed to boot n0
> (node1.cluster.ird.nc) n-1<6372> ssi:boot:base:server:
> closing server socket n-1<6372> ssi:boot:base:linear: aborted!
> --------------------------------------------------------------
> ---------------
> lamboot encountered some error (see above) during the boot
> process, and will now attempt to kill all nodes that it was
> previously able to boot (if any).
>
> Please wait for LAM to finish; if you interrupt this process,
> you may have LAM daemons still running on remote nodes.
> --------------------------------------------------------------
> ---------------
> lamboot: wipe -- nothing to do
> lamboot did NOT complete successfully
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest &
> candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Oscar-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

RE: [Oscar-users] RE: lamboot not found

Reply via email to