Hi,
Now, my default MPI propagates cross the cluster, because my home was not mounted. I think my HomeDir was not mounted because i boot first nodes before cluster. Cexec mount -a and all is right now. "cexec switcher mpi" show me my new default MPI "Lam-oscar-7.0-ifort"
To test my LAM configuration, i edit by hand Lam-7.0-ifort/etc/lam-bhost.def on my front-end with my nodes and frontend.
However, if i type lamboot -v, Lam complains about : n-1<6228> ssi:boot:base:linear:booting n0 (node1.cluster.ird.nc) ERROR : node1.cluster.ird.nc : connection refused
The following told about rsh...
I have a doubt : When I configure Lam-7.0.6 from source, i omit to specify configure --with-rsh="ssh -x" ? Is matter ?
See below my output from "lamboot -d"
Many thanks
jerome
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<6372> ssi:boot:base: looking for boot schema in following directories:
n-1<6372> ssi:boot:base: <current directory>
n-1<6372> ssi:boot:base: $TROLLIUSHOME/etc
n-1<6372> ssi:boot:base: $LAMHOME/etc
n-1<6372> ssi:boot:base: /opt/lam-7.0-ifort/etc
n-1<6372> ssi:boot:base: looking for boot schema file:
n-1<6372> ssi:boot:base: lam-bhost.def
n-1<6372> ssi:boot:base: found boot schema: /opt/lam-7.0-ifort/etc/lam-bhost.defn-1<6372> ssi:boot:rsh: found the following hosts:
n-1<6372> ssi:boot:rsh: n0 node1.cluster.ird.nc (cpu=2)
n-1<6372> ssi:boot:rsh: n1 node2.cluster.ird.nc (cpu=2)
n-1<6372> ssi:boot:rsh: n2 node3.cluster.ird.nc (cpu=2)
n-1<6372> ssi:boot:rsh: n3 editr.cluster.ird.nc (cpu=2)
n-1<6372> ssi:boot:rsh: resolved hosts:
n-1<6372> ssi:boot:rsh: n0 node1.cluster.ird.nc --> 192.168.150.1
n-1<6372> ssi:boot:rsh: n1 node2.cluster.ird.nc --> 192.168.150.2
n-1<6372> ssi:boot:rsh: n2 node3.cluster.ird.nc --> 192.168.150.3
n-1<6372> ssi:boot:rsh: n3 editr.cluster.ird.nc --> 192.168.150.50 (origin)
n-1<6372> ssi:boot:rsh: starting RTE procs
n-1<6372> ssi:boot:base:linear: starting
n-1<6372> ssi:boot:base:server: opening server TCP socket
n-1<6372> ssi:boot:base:server: opened port 37073
n-1<6372> ssi:boot:base:linear: booting n0 (node1.cluster.ird.nc)
n-1<6372> ssi:boot:rsh: starting lamd on (node1.cluster.ird.nc)
n-1<6372> ssi:boot:rsh: starting on n0 (node1.cluster.ird.nc): hboot -t -c lam-conf.lamd -d -s -I "-H 192.168.150.50 -P 37073 -n 0 -o 3"
n-1<6372> ssi:boot:rsh: launching remotely
n-1<6372> ssi:boot:rsh: attempting to execute "rsh node1.cluster.ird.nc -n echo $SHELL"
ERROR: LAM/MPI unexpectedly received the following on stderr:
node1.cluster.ird.nc: Connection refused
-----------------------------------------------------------------------------
LAM failed to execute a process on the remote node "node1.cluster.ird.nc".
LAM was not trying to invoke any LAM-specific commands yet -- we were
simply trying to determine what shell was being used on the remote
host.
LAM tried to use the remote agent command "rsh" to invoke "echo $SHELL" on the remote node.
This usually indicates an authentication problem with the remote agent, or some other configuration type of error in your .cshrc or .profile file. The following is a list of items that you may wish to check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shellTry invoking the following command at the unix command line:
rsh node1.cluster.ird.nc -n echo $SHELL
You will need to configure your local setup such that you will *not* be prompted for a password to invoke this command on the remote node. No output should be printed from the remote node before the output of the command is displayed.
When you can get this command to execute successfully by hand, LAM will probably be able to function properly. ----------------------------------------------------------------------------- n-1<6372> ssi:boot:base:linear: Failed to boot n0 (node1.cluster.ird.nc) n-1<6372> ssi:boot:base:server: closing server socket n-1<6372> ssi:boot:base:linear: aborted! ----------------------------------------------------------------------------- lamboot encountered some error (see above) during the boot process, and will now attempt to kill all nodes that it was previously able to boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may have LAM daemons still running on remote nodes. ----------------------------------------------------------------------------- lamboot: wipe -- nothing to do lamboot did NOT complete successfully
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
