> Can you describe the problems that you were having with ssh? Someone
> here on the oscar-users list may be able to help.
The problem was:
I followed the steps page 16 in the manual "MPI Primer/Developng With LAM"
I deleted the fixed line in the file of the node 2 "node2.nuevo-cluster"
for show you the problem that I had.
I wrote the following command like user:
recon -v lamhosts
And I got:
*******************************************
recon: -- testing n0 (malambo)
recon: -- testing n1 (node2.nuevo-cluster)
bash: tkill: command not found
-----------------------------------------------------------------------------
LAM failed to execute a LAM binary on the remote node "node2.nuevo-cluster".
Since LAM was already able to determine your remote shell as "tkill",
it is probable that this is not an authentication problem.
LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke the following command:
/usr/bin/ssh node2.nuevo-cluster -n tkill -N
This can indicate several things. You should check the following:
- The LAM binaries are in your $PATH
- You can run the LAM binaries
- The $PATH variable is set properly before your
.cshrc/.profile exits
Try to invoke the command listed above manually at a Unix prompt.
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
recon: "node2.nuevo-cluster" cannot be booted.
recon: Unknown error 127
-----------------------------------------------------------------------------
recon was not able to complete successfully. There can be any number
of problems that did not allow recon to work properly. You should use
the "-d" option to recon to get more information about each step that
recon attempts.
Any error message above may present a more detailed description of the
actual problem.
Here is general a list of prerequesites that *must* be fullfilled
before recon can work:
- Each machine in the hostfile must be reachable and operational.
- You must have an account on each machine.
- You must be able to rsh(1) to the machine (permissions
are typically set in the user's $HOME/.rhosts file).
*** Sidenote: If you compiled LAM to use a remote shell program
other than rsh (with the --with-rsh option to ./configure;
e.g., ssh), or if you set the LAMRSH environment variable
to an alternate remote shell program, you need to ensure
that you can execute programs on remote nodes with no
password. For example:
unix% ssh -x pinky uptime
3:09am up 211 day(s), 23:49, 2 users, load average: 0.01, 0.08, 0.10
- The LAM executables must be locatable on each machine, using
the shell's search path and possibly the LAMHOME environment
variable.
- The shell's start-up script must not print anything on standard
error. You can take advantage of the fact that rsh(1) will
start the shell non-interactively. The start-up script (such
as .profile or .cshrc) can exit early in this case, before
executing many commands relevant only to interactive sessions
and likely to generate output.
----------------------------------------------------------------------------
**********************************
I did some test with ssh:
ssh node2.nuevo-cluster 'ls'
The above command works, when you use "ls, ps, grep, ..." ssh executes
that, but,
if you use other command instead of "ls, ps,..."
for example
ssh node2.nuevo-cluster 'mpirun -np 2 namefile'
you get
bash: mpirun: command not found
By root, I fix the problem with the following line in the file /etc/bashrc:
export PATH= /optlam-6.5.6/bin:$PATH
I wrote this line in all nodes except in the server.
I don't know if it is the right method, but I run MPI/LAM programs in the
cluster now.
Daniel Burbano
ciao
> On Sat, 15 Jun 2002 [EMAIL PROTECTED] wrote:
>
>> > I should note that if you are using PBS (which I see from below that
>> > you are not), PBS will create machinefiles for you. These should be
>> > used instead of the default machines.LINUX file for PBS runs.
>>
>> I saw in machine.LINUX, that OSCAR did not write the number of
>> processors of each node. This installation is the standar
>> installation when you install mpich.
>
> I believe that this was a bug that was discovered in OSCAR 1.2.1, and I
> think it has been fixed for 1.3.
>
>> > - OSCAR clusers are setup such that ssh should "just work" for the
>> > users. You shouldn't need to setup any keys or anything like that --
>> > OSCAR should have done that for you.
>>
>> I have to, because ssh could not execute remote commands from the
>> server to the nodes. And I had problems to execute MPI programs with
>> LAM.
>
> Can you describe the problems that you were having with ssh? Someone
> here on the oscar-users list may be able to help.
>
> This may have impacted your ability to run LAM, because OSCAR's LAM
> uses ssh by default.
>
> {+} Jeff Squyres
> {+} [EMAIL PROTECTED]
> {+} http://www.lam-mpi.org/
_______________________________________________________________
Sponsored by:
ThinkGeek at http://www.ThinkGeek.com/
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users