pfilter shouldn't block LAM processes...  should it?
 
Cheers,
 
Bernard


From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Salvatore Di Nardo
Sent: Wednesday, January 26, 2005 8:43
To: OSCAR
Subject: Re: [Oscar-users] lamboot will not start (OSCAR4 on FC2-i386)

ok.. i found my mistake: Nodes had pfilter active.


On Wed, 2005-01-26 at 11:45, Salvatore Di Nardo wrote:
i succesfully ( i hope) installed OSCAR4 on FC2 (i386), also PBS is configured propertly, but i have problems to use lam and lamd.
If i try to start a lam session

> lamboot my_hostfile

where my_hostfile contains:

"
node002 cpu=2 user=salvator
node003 cpu=2 user=salvator
oscarcluster cpu=2 user=salvator

"

i obtain this error:

"
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

-----------------------------------------------------------------------------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.2, port 32806.

Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process.  As such, this may indicate:

        1. 10.10.10.2 is not the correct IP address for the machine where the
           newly-booted machine was launched
        2. There are network filters between the lamboot agent host and
           the remote host such that communication on random TCP ports
           is blocked
        3. Network routing from the the local host to the remote isn't
           properly configured (this is unlikely)

For number 1, check to ensure that 10.10.10.2 is the correct IP address for
that machine.  If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.2 is both reachable and is the by
the host where the lamboot agent is running, and is the correct host.

For numbers 2 and 4, try to telnet to 10.10.10.2, port 32806.  You should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.  If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
The lamboot agent failed to open a client socket to the newly-booted
process at IP address 10.10.10.3, port 32775.

Although the newly-booted process has already communicated
successfully with the lamboot agent over other TCP sockets, this is
the first time that the lamboot agent tried to initiate a connection
to the newly-booted process.  As such, this may indicate:

        1. 10.10.10.3 is not the correct IP address for the machine where the
           newly-booted machine was launched
        2. There are network filters between the lamboot agent host and
           the remote host such that communication on random TCP ports
           is blocked
        3. Network routing from the the local host to the remote isn't
           properly configured (this is unlikely)

For number 1, check to ensure that 10.10.10.3 is the correct IP address for
that machine.  If it is not, check the host mapping on that machine
(e.g., /etc/hosts) to ensure that 10.10.10.3 is both reachable and is the by
the host where the lamboot agent is running, and is the correct host.

For numbers 2 and 4, try to telnet to 10.10.10.3, port 32775.  You should get a
"connection refused" error, which will indicate that you successfully
connected to some machine at that IP address, and no process was
listening on that port.  If you get any other kind of error, check
with your system/network administrator -- it may indicate network /
routing issues between the two hosts.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).

Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------

"


note that this command:

> /usr/bin/ssh node003 -n -l salvator echo $SHELL

work propertly without asking password, and i got in answer:

> /bin/bash

same thing for other nodes.
Any suggestion ?


Salvatore Di Nardo

Reply via email to