Thanks again for your help.
We have tried a number of things but cannot seem to find what the
problem is. I will list a few of the things we have tried and some
other information to see if it helps point to anything...
Problem:
Cannot SSH from the compute back to the head node, which causes several
problems.
General Stuff:
- The server can SSH into the compute node with no problem
- We can ping the server from the compute node
- The node mounts /home via NFS fine, and /home is accessible on the
compute node
- pbsnodes -a displays node information correctly about the node and
resources available
- You can see submitted jobs using qstat
- When you submit a job (either the test_cluster script or a user script
with qsub), it will go to the node, run, and then return to the "Q"
state after it runs. I assume that normally, the node will communicate
with the head node telling that it is done, then the head node will send
a command to delete the job. No output or error files are generated
where the script was run. The jobs can be manually deleted using qdel.
I assume this is caused by the failure of node->server SSH.
What we have tried:
- The stuff mentioned thus far in this thread
- Running the start_over script and reinstalling OSCAR with only one
active ethernet connection set to the internal network. The result is
the same.
- Changing the internal IP and hostname of the head node
- checked the sshd_config and hosts files
- other things I can't remember at the moment.
Does anyone have any other ideas why the SSH may only be one way in
nature??? This is terribly frustrating and it seems like no one has had
a similar problem when trying to set up a cluster. I really am at a
loss what could be the cause.
Is there any reason why a driver problem would allow pings and NFS but
only allow a one-way SSH connection?
Hardware information:
Server: Q6600 on an eVGA nVidia motherboard (nVidia ethernet and SCSI
controller)
Nodes: Dual E5345 on an ASUS Intel motherboard (Intel ethernet and ahci
SCSI controller)
(is there any reason why the hardware differences could cause such a
problem... I already put a modprobe.conf file from a compute node
installed with linux in the image file prior to PXE boot, otherwise it
would kernel panic on reboot after imaging)
Thanks again for any help you can provide...
Sincerely,
Rob
Michael Edwards wrote:
Okay, so looking at the original oscarinstall.log and reading your
original message again, two things jump out at me. The first is that
in your /etc/hosts file you have one hostname mapping to two different
IP addresses. This may cause confusion.
The other thing I notice is that OSCAR isn't seeing the hostname in
your /etc/hosts file at all, but a very long one instead that looks
like a DHCP assigned one.
Take a look at my suggestions here (
http://svn.oscar.openclustergroup.org/trac/oscar/wiki/TipTwoNetworkInterfaces)
and see if they make any sense.
Not sure what the issue with the new IP is but it seems like there was
some conflict with the old one since ssh is now at least trying to
connect.
On 10/26/07, *Robert Ashcraft* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
So, I changed the internal IP to 192.168.1.10
<http://192.168.1.10> to prevent any would-be conflicts, ran the
start_over script, and went through the usual setup process.
Basically the same thing happened, but I got the iptable and
verbose SSH output. Maybe you can help make a little sense out of
it.
Here is the iptable output:
=====
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
=====
Here is the SSH -vvv command when run from oscarnode01 trying to
get back into the head node:
=====
[EMAIL PROTECTED] ~]# ssh -vvv 192.168.1.10 <http://192.168.1.10>
OpenSSH_3.9p1, OpenSSL 0.9.7a Feb 19 2003
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.1.10 <http://192.168.1.10> [
192.168.1.10 <http://192.168.1.10>] port 22.
debug1: Connection established.
debug1: permanently_set_uid: 0/0
debug1: identity file //root//.ssh/identity type 0
debug3: Not a RSA1 key file /
/root//.ssh/id_rsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_rsa type 1
debug3: Not a RSA1 key file //root//.ssh/id_dsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_dsa type 2
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_rsa type 1
debug3: Not a RSA1 key file //root//.ssh/id_dsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_dsa type 2
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_rsa type 1
debug3: Not a RSA1 key file //root//.ssh/id_dsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_dsa type 2
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_rsa type 1
debug3: Not a RSA1 key file //root//.ssh/id_dsa.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file //root//.ssh/id_dsa type 2
=====
Thanks again for your help.
Rob
Michael Edwards wrote:
OSCAR doesn't need a gateway on the head node to work. One way
communication generally implies there is a firewall on the head
node or other routing problem.
What do you get from "iptables -L" on the head node?
You might try using a different address for the head node than
192.168.0.1 <http://192.168.0.1>, that is a common default
address for networking hardware and can cause problems like this
occasionally. I have become fond of 10.0.0.x because it isn't
used as much.
You could also change the switch address too, if that is the problem.
On 10/26/07, *Robert Ashcraft* < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Michael,
Thanks for the response. My colleague has tried those things
and they did not seems to help. The "ssh -vvv" command does
not provide any output and presumably just hangs somewhere in
the connection process.
Just as some information... If we set up the the compute
node to connect to DCHP over the external MIT network (not
through the switch), I was able to get two way communication
(through the MIT network). This seems to imply that it is
some wrong with the static IP setup or something related to
the switch. However, the one-way communication is puzzling.
I don't think we have a gateway specified for the head node
internal IP address, only the IP (192.168.0.1
<http://192.168.0.1>) and subnet mask (255.255.255.0
<http://255.255.255.0>). Could that be the source of any
problems?
We will continue to try to diagnose the problem, but any more
insight would be welcomed. Thanks,
Rob
Michael Edwards wrote:
Do you have the firewall on the head node turned off?
You can check by doing "iptables -L" or checking under the
"security level" utility.
You can also try doing "ssh -vvv [EMAIL PROTECTED] " and see if
it gives you any clues.
On 10/25/07, *Robert Wilson Ashcraft* <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi,
I am attempting to set up an OSCAR cluster. I have
gotten through everything
past step 7, Complete CLuster Setup (which finished
successfully).
However, when I run the cluster tests, I get several
failures, most noticibly
with the node--> server communication.
This is also confirmed by the fact that I can SSH to a
node, but when I am
logged into the node, I cannot SSH back into the server
(it just hangs... no
error message, but I can ctrl-C out of it)
Do you have any idea why the SSH from the client to
server would not be working?
I have a feeling that if this problem is solved, the
other failed test will work
themselves out.
I am attaching the oscarinstall.log file in case that helps.
Here is my /etc/hosts file if that helps:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 <http://127.0.0.1>
localhost.localdomain localhost
192.168.0.1 <http://192.168.0.1> pharos.mit.edu
<http://pharos.mit.edu> pharos oscar_server nfs_oscar
pbs_oscar
18.80.7.242 <http://18.80.7.242> pharos.mit.edu
<http://pharos.mit.edu> pharos
# These entries are managed by SIS, please don't modify
them.
192.168.0.100 <http://192.168.0.100>
oscarnode01.mit.edu
<http://oscarnode01.mit.edu> oscarnode01
Thanks for your help.
Rob Ashcraft
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX
and a browser.
Download your FREE copy of Splunk now >>
http://get.splunk.com/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
<mailto:Oscar-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------------------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>
http://get.splunk.com/
------------------------------------------------------------------------
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
<mailto:Oscar-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/oscar-users
<https://lists.sourceforge.net/lists/listinfo/oscar-users>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and
a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
<mailto:Oscar-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------------------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
------------------------------------------------------------------------
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
<mailto:Oscar-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/oscar-users
--
Robert W. Ashcraft
Ph.D. Candidate
Dept. Chemical Engineering
Massachusetts Institute of Technology
77 Massachusetts Ave.
Room 66-264
Cambridge, MA 02139
Phone: 617-253-6554
E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a
browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
<mailto:Oscar-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------------------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
------------------------------------------------------------------------
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users