Hi Bernard,
Yes, with export LAMRSH='ssh -x' , all is right. I will rebuild my Lam-ifort with this flag. Many thanks for your tip.
Now, if i run "qsub mpihello.pbs", $PBS_NODEFILE list only 2 nodes.
node3.cluster.ird.nc
node2.cluster.ird.ncEven if i place in my PBS script the flag "N" after mpirun.
If i run lamboot and mpirun -np 8 mpihello, my 4 nodes compute ?!.
I think PBS don't give me the correct "node chart".
With OSCAR, how PBS define PBS_Nodefile ? How to correct it ?
Many thanks
Ciao. J�rome
A 15:19 21/03/2005 -0800, Bernard Li a �crit :
Hi Jerome:
You can either:
1) Turn on rsh on the cluster nodes (this is turned off by default on OSCAR cluster) 2) Use ssh instead of rsh... export LAMRSH='ssh -x' (I think that's the correct environment variable).
Cheers,
Bernard
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Lefevre Jerome > Sent: Monday, March 21, 2005 15:05 > To: [email protected] > Subject: [Oscar-users] RE: lamboot not found > > Hi, > > Now, my default MPI propagates cross the cluster, because my > home was not mounted. I think my HomeDir was not mounted > because i boot first nodes before cluster. Cexec mount -a and > all is right now. "cexec switcher mpi" > show me my new default MPI "Lam-oscar-7.0-ifort" > > To test my LAM configuration, i edit by hand > Lam-7.0-ifort/etc/lam-bhost.def on my front-end with my nodes > and frontend. > > However, if i type lamboot -v, Lam complains about : > n-1<6228> ssi:boot:base:linear:booting n0 > (node1.cluster.ird.nc) ERROR : node1.cluster.ird.nc : > connection refused > > The following told about rsh... > > I have a doubt : When I configure Lam-7.0.6 from source, i > omit to specify configure --with-rsh="ssh -x" ? Is matter ? > > See below my output from "lamboot -d" > > Many thanks > > jerome > > LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University > > n-1<6372> ssi:boot:base: looking for boot schema in following > directories: > n-1<6372> ssi:boot:base: <current directory> > n-1<6372> ssi:boot:base: $TROLLIUSHOME/etc > n-1<6372> ssi:boot:base: $LAMHOME/etc > n-1<6372> ssi:boot:base: /opt/lam-7.0-ifort/etc > n-1<6372> ssi:boot:base: looking for boot schema file: > n-1<6372> ssi:boot:base: lam-bhost.def > n-1<6372> ssi:boot:base: found boot schema: > /opt/lam-7.0-ifort/etc/lam-bhost.defn-1<6372> ssi:boot:rsh: > found the following hosts: > n-1<6372> ssi:boot:rsh: n0 node1.cluster.ird.nc (cpu=2) > n-1<6372> ssi:boot:rsh: n1 node2.cluster.ird.nc (cpu=2) > n-1<6372> ssi:boot:rsh: n2 node3.cluster.ird.nc (cpu=2) > n-1<6372> ssi:boot:rsh: n3 editr.cluster.ird.nc (cpu=2) > n-1<6372> ssi:boot:rsh: resolved hosts: > n-1<6372> ssi:boot:rsh: n0 node1.cluster.ird.nc --> 192.168.150.1 > n-1<6372> ssi:boot:rsh: n1 node2.cluster.ird.nc --> 192.168.150.2 > n-1<6372> ssi:boot:rsh: n2 node3.cluster.ird.nc --> 192.168.150.3 > n-1<6372> ssi:boot:rsh: n3 editr.cluster.ird.nc --> > 192.168.150.50 (origin) > n-1<6372> ssi:boot:rsh: starting RTE procs n-1<6372> > ssi:boot:base:linear: starting n-1<6372> > ssi:boot:base:server: opening server TCP socket n-1<6372> > ssi:boot:base:server: opened port 37073 n-1<6372> > ssi:boot:base:linear: booting n0 (node1.cluster.ird.nc) > n-1<6372> ssi:boot:rsh: starting lamd on > (node1.cluster.ird.nc) n-1<6372> ssi:boot:rsh: starting on n0 > (node1.cluster.ird.nc): hboot -t -c lam-conf.lamd -d -s -I > "-H 192.168.150.50 -P 37073 -n 0 -o 3" > n-1<6372> ssi:boot:rsh: launching remotely n-1<6372> > ssi:boot:rsh: attempting to execute "rsh node1.cluster.ird.nc > -n echo $SHELL" > ERROR: LAM/MPI unexpectedly received the following on stderr: > node1.cluster.ird.nc: Connection refused > -------------------------------------------------------------- > --------------- > LAM failed to execute a process on the remote node > "node1.cluster.ird.nc". > LAM was not trying to invoke any LAM-specific commands yet -- > we were simply trying to determine what shell was being used > on the remote host. > > LAM tried to use the remote agent command "rsh" > to invoke "echo $SHELL" on the remote node. > > This usually indicates an authentication problem with the > remote agent, or some other configuration type of error in > your .cshrc or .profile file. The following is a list of > items that you may wish to check on the remote node: > > - You have an account and can login to the remote machine > - Incorrect permissions on your home directory (should > probably be 0755) > - Incorrect permissions on your $HOME/.rhosts file > (if you are > using rsh -- they should probably be 0644) > - You have an entry in the remote $HOME/.rhosts file (if you > are using rsh) for the machine and username that you are > running from > - Your .cshrc/.profile must not print anything out to the > standard error > - Your .cshrc/.profile should set a correct TERM type > - Your .cshrc/.profile should set the SHELL environment > variable to your default shell > > Try invoking the following command at the unix command line: > > rsh node1.cluster.ird.nc -n echo $SHELL > > You will need to configure your local setup such that you > will *not* be prompted for a password to invoke this command > on the remote node. > No output should be printed from the remote node before the > output of the command is displayed. > > When you can get this command to execute successfully by > hand, LAM will probably be able to function properly. > -------------------------------------------------------------- > --------------- > n-1<6372> ssi:boot:base:linear: Failed to boot n0 > (node1.cluster.ird.nc) n-1<6372> ssi:boot:base:server: > closing server socket n-1<6372> ssi:boot:base:linear: aborted! > -------------------------------------------------------------- > --------------- > lamboot encountered some error (see above) during the boot > process, and will now attempt to kill all nodes that it was > previously able to boot (if any). > > Please wait for LAM to finish; if you interrupt this process, > you may have LAM daemons still running on remote nodes. > -------------------------------------------------------------- > --------------- > lamboot: wipe -- nothing to do > lamboot did NOT complete successfully > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & > candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Oscar-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/oscar-users >
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. _______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
