Did you try to "telnet to 10.10.10.2, port 32806" as the help message suggests?

This is quite an unusual error (but not unheard of, otherwise we wouldn't have made such an extensive help message about it ;-) ).

Can you send the output of "lamboot -d my_hostfile"? This includes a *lot* more verbosity and is helpful for checking what is going wrong in situations like this.



On Jan 26, 2005, at 2:45 AM, Salvatore Di Nardo wrote:

i succesfully ( i hope) installed OSCAR4 on FC2 (i386), also PBS is configured propertly, but i have problems to use lam and lamd.
If i try to start a lam session


> lamboot my_hostfile

where my_hostfile contains:

"
node002 cpu=2 user=salvator
 node003 cpu=2 user=salvator
 oscarcluster cpu=2 user=salvator
"

 i obtain this error:

"
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

----------------------------------------------------------------------- ------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.2, port 32806.


 Although the newly-booted process has already communicated
 successfully with the lamboot agent over other TCP sockets, this is
 the first time that the lamboot agent tried to initiate a connection
 to the newly-booted process.� As such, this may indicate:

������� 1. 10.10.10.2 is not the correct IP address for the machine where the
���������� newly-booted machine was launched
������� 2. There are network filters between the lamboot agent host and
���������� the remote host such that communication on random TCP ports
���������� is blocked
������� 3. Network routing from the the local host to the remote isn't
���������� properly configured (this is unlikely)


For number 1, check to ensure that 10.10.10.2 is the correct IP address for
that machine.� If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.2 is both reachable and is the by
the host where the lamboot agent is running, and is the correct host.


For numbers 2 and 4, try to telnet to 10.10.10.2, port 32806.� You should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.� If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
----------------------------------------------------------------------- ------
----------------------------------------------------------------------- ------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.3, port 32775.


 Although the newly-booted process has already communicated
 successfully with the lamboot agent over other TCP sockets, this is
 the first time that the lamboot agent tried to initiate a connection
 to the newly-booted process.� As such, this may indicate:

������� 1. 10.10.10.3 is not the correct IP address for the machine where the
���������� newly-booted machine was launched
������� 2. There are network filters between the lamboot agent host and
���������� the remote host such that communication on random TCP ports
���������� is blocked
������� 3. Network routing from the the local host to the remote isn't
���������� properly configured (this is unlikely)


For number 1, check to ensure that 10.10.10.3 is the correct IP address for
that machine.� If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.3 is both reachable and is the by
the host where the lamboot agent is running, and is the correct host.


For numbers 2 and 4, try to telnet to 10.10.10.3, port 32775.� You should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.� If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
----------------------------------------------------------------------- ------
----------------------------------------------------------------------- ------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).


Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
----------------------------------------------------------------------- ------
"



note that this command:

 > /usr/bin/ssh node003 -n -l salvator echo $SHELL

 work propertly without asking password, and i got in answer:

> /bin/bash

same thing for other nodes.
 Any suggestion ?


Salvatore Di Nardo

-- {+} Jeff Squyres {+} [EMAIL PROTECTED] {+} http://www.lam-mpi.org/



-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to