Not, it should not. It's supposed to set itself up to allow
unrestricted TCP and UDP access between the entire OSCAR cluster (but
nowhere else).
On Jan 26, 2005, at 9:56 AM, Bernard Li wrote:
pfilter shouldn't block LAM processes...� should it?
�
Cheers,
�
Bernard
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Salvatore Di Nardo
Sent: Wednesday, January 26, 2005 8:43
To: OSCAR
Subject: Re: [Oscar-users] lamboot will not start (OSCAR4 on FC2-i386)
ok.. i found my mistake: Nodes had pfilter active.
On Wed, 2005-01-26 at 11:45, Salvatore Di Nardo wrote:
i succesfully ( i hope) installed OSCAR4 on FC2 (i386), also PBS is
configured propertly, but i have problems to use lam and lamd.
If i try to start a lam session
> lamboot my_hostfile
where my_hostfile contains:
"
node002 cpu=2 user=salvator
node003 cpu=2 user=salvator
oscarcluster cpu=2 user=salvator
"
i obtain this error:
"
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
-----------------------------------------------------------------------
------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.2, port 32806.
Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process.� As such, this may indicate:
������� 1. 10.10.10.2 is not the correct IP address for the machine
where the
���������� newly-booted machine was launched
������� 2. There are network filters between the lamboot agent host and
���������� the remote host such that communication on random TCP ports
���������� is blocked
������� 3. Network routing from the the local host to the remote isn't
���������� properly configured (this is unlikely)
For number 1, check to ensure that 10.10.10.2 is the correct IP
address for
that machine.� If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.2 is both reachable and is
the by
the host where the lamboot agent is running, and is the correct host.
For numbers 2 and 4, try to telnet to 10.10.10.2, port 32806.� You
should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.� If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------
------
-----------------------------------------------------------------------
------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.3, port 32775.
Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process.� As such, this may indicate:
������� 1. 10.10.10.3 is not the correct IP address for the machine
where the
���������� newly-booted machine was launched
������� 2. There are network filters between the lamboot agent host and
���������� the remote host such that communication on random TCP ports
���������� is blocked
������� 3. Network routing from the the local host to the remote isn't
���������� properly configured (this is unlikely)
For number 1, check to ensure that 10.10.10.3 is the correct IP
address for
that machine.� If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.3 is both reachable and is
the by
the host where the lamboot agent is running, and is the correct host.
For numbers 2 and 4, try to telnet to 10.10.10.3, port 32775.� You
should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.� If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------
------
-----------------------------------------------------------------------
------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------
------
"
note that this command:
> /usr/bin/ssh node003 -n -l salvator echo $SHELL
work propertly without asking password, and i got in answer:
> /bin/bash
same thing for other nodes.
Any suggestion ?
Salvatore Di Nardo
--
{+} Jeff Squyres
{+} [EMAIL PROTECTED]
{+} http://www.lam-mpi.org/
-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users