Thanks again for your help. We have tried a number of things but cannot seem to find what the problem is. I will list a few of the things we have tried and some other information to see if it helps point to anything...

Problem:
Cannot SSH from the compute back to the head node, which causes several problems.

General Stuff:
- The server can SSH into the compute node with no problem
- We can ping the server from the compute node
- The node mounts /home via NFS fine, and /home is accessible on the compute node - pbsnodes -a displays node information correctly about the node and resources available
- You can see submitted jobs using qstat
- When you submit a job (either the test_cluster script or a user script with qsub), it will go to the node, run, and then return to the "Q" state after it runs. I assume that normally, the node will communicate with the head node telling that it is done, then the head node will send a command to delete the job. No output or error files are generated where the script was run. The jobs can be manually deleted using qdel. I assume this is caused by the failure of node->server SSH.


What we have tried:
- The stuff mentioned thus far in this thread
- Running the start_over script and reinstalling OSCAR with only one active ethernet connection set to the internal network. The result is the same. - Changing the internal IP and hostname of the head node
- checked the sshd_config and hosts files
- other things I can't remember at the moment.


Does anyone have any other ideas why the SSH may only be one way in nature??? This is terribly frustrating and it seems like no one has had a similar problem when trying to set up a cluster. I really am at a loss what could be the cause. Is there any reason why a driver problem would allow pings and NFS but only allow a one-way SSH connection?

Hardware information:
Server: Q6600 on an eVGA nVidia motherboard (nVidia ethernet and SCSI controller) Nodes: Dual E5345 on an ASUS Intel motherboard (Intel ethernet and ahci SCSI controller) (is there any reason why the hardware differences could cause such a problem... I already put a modprobe.conf file from a compute node installed with linux in the image file prior to PXE boot, otherwise it would kernel panic on reboot after imaging)

Thanks again for any help you can provide...

Sincerely,

Rob



Michael Edwards wrote:
Okay, so looking at the original oscarinstall.log and reading your original message again, two things jump out at me. The first is that in your /etc/hosts file you have one hostname mapping to two different IP addresses. This may cause confusion.

The other thing I notice is that OSCAR isn't seeing the hostname in your /etc/hosts file at all, but a very long one instead that looks like a DHCP assigned one.

Take a look at my suggestions here ( http://svn.oscar.openclustergroup.org/trac/oscar/wiki/TipTwoNetworkInterfaces) and see if they make any sense.

Not sure what the issue with the new IP is but it seems like there was some conflict with the old one since ssh is now at least trying to connect.

On 10/26/07, *Robert Ashcraft* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    So, I changed the internal IP to 192.168.1.10
    <http://192.168.1.10> to prevent any would-be conflicts, ran the
    start_over script, and went through the usual setup process.

    Basically the same thing happened, but I got the iptable and
    verbose SSH output.  Maybe you can help make a little sense out of
it.
    Here is the iptable output:
    =====

    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination

    Chain FORWARD (policy ACCEPT)
    target     prot opt source               destination

    Chain OUTPUT (policy ACCEPT)

    target     prot opt source               destination

    =====

    Here is the SSH -vvv command when run from oscarnode01 trying to
    get back into the head node:

    =====

    [EMAIL PROTECTED] ~]# ssh -vvv 192.168.1.10 <http://192.168.1.10>
    OpenSSH_3.9p1, OpenSSL 0.9.7a Feb 19 2003
    debug1: Reading configuration data /etc/ssh/ssh_config
    debug1: Applying options for *
    debug2: ssh_connect: needpriv 0
    debug1: Connecting to 192.168.1.10 <http://192.168.1.10> [
    192.168.1.10 <http://192.168.1.10>] port 22.
    debug1: Connection established.
    debug1: permanently_set_uid: 0/0
    debug1: identity file //root//.ssh/identity type 0
    debug3: Not a RSA1 key file /
    /root//.ssh/id_rsa.
    debug2: key_type_from_name: unknown key type '-----BEGIN'
    debug3: key_read: missing keytype
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_rsa type 1
    debug3: Not a RSA1 key file //root//.ssh/id_dsa.
    debug2: key_type_from_name: unknown key type '-----BEGIN'

    debug3: key_read: missing keytype
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_dsa type 2

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_rsa type 1
    debug3: Not a RSA1 key file //root//.ssh/id_dsa.
    debug2: key_type_from_name: unknown key type '-----BEGIN'

    debug3: key_read: missing keytype
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_dsa type 2

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_rsa type 1
    debug3: Not a RSA1 key file //root//.ssh/id_dsa.
    debug2: key_type_from_name: unknown key type '-----BEGIN'

    debug3: key_read: missing keytype
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_dsa type 2

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_rsa type 1
    debug3: Not a RSA1 key file //root//.ssh/id_dsa.
    debug2: key_type_from_name: unknown key type '-----BEGIN'

    debug3: key_read: missing keytype
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace

    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug3: key_read: missing whitespace
    debug2: key_type_from_name: unknown key type '-----END'

    debug3: key_read: missing keytype
    debug1: identity file //root//.ssh/id_dsa type 2

    =====

    Thanks again for your help.

    Rob



    Michael Edwards wrote:
    OSCAR doesn't need a gateway on the head node to work.  One way
    communication generally implies there is a firewall on the head
    node or other routing problem.

    What do you get from "iptables -L" on the head node?

    You might try using a different address for the head node than
    192.168.0.1 <http://192.168.0.1>, that is a common default
    address for networking hardware and can cause problems like this
    occasionally.  I have become fond of 10.0.0.x because it isn't
    used as much.

    You could also change the switch address too, if that is the problem.

    On 10/26/07, *Robert Ashcraft* < [EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>> wrote:

        Michael,

        Thanks for the response.  My colleague has tried those things
        and they did not seems to help.  The "ssh -vvv" command does
        not provide any output and presumably just hangs somewhere in
the connection process.
        Just as some information...  If we set up the the compute
        node to connect to DCHP over the external MIT network (not
        through the switch), I was able to get two way communication
        (through the MIT network).  This seems to imply that it is
        some wrong with the static IP setup or something related to
the switch. However, the one-way communication is puzzling. I don't think we have a gateway specified for the head node
        internal IP address, only the IP (192.168.0.1
        <http://192.168.0.1>) and subnet mask (255.255.255.0
        <http://255.255.255.0>).  Could that be the source of any
        problems?

        We will continue to try to diagnose the problem, but any more
        insight would be welcomed.  Thanks,

        Rob


        Michael Edwards wrote:
        Do you have the firewall on the head node turned off?

        You can check by doing "iptables -L" or checking under the
        "security level" utility.

        You can also try doing "ssh -vvv [EMAIL PROTECTED] " and see if
        it gives you any clues.

        On 10/25/07, *Robert Wilson Ashcraft* <[EMAIL PROTECTED]
        <mailto:[EMAIL PROTECTED]>> wrote:

            Hi,

            I am attempting to set up an OSCAR cluster.  I have
            gotten through everything
            past step 7, Complete CLuster Setup (which finished
            successfully).

            However, when I run the cluster tests, I get several
            failures, most noticibly
            with the node--> server communication.

            This is also confirmed by the fact that I can SSH to a
            node, but when I am
            logged into the node, I cannot SSH back into the server
            (it just hangs... no
            error message, but I can ctrl-C out of it)

            Do you have any idea why the SSH from the client to
            server would not be working?

            I have a feeling that if this problem is solved, the
            other failed test will work
            themselves out.

            I am attaching the oscarinstall.log file in case that helps.

            Here is my /etc/hosts file if that helps:
            # Do not remove the following line, or various programs
            # that require network functionality will fail.
127.0.0.1 <http://127.0.0.1> localhost.localdomain localhost
            192.168.0.1 <http://192.168.0.1>     pharos.mit.edu
            <http://pharos.mit.edu> pharos oscar_server nfs_oscar
            pbs_oscar
            18.80.7.242 <http://18.80.7.242>     pharos.mit.edu
            <http://pharos.mit.edu> pharos

            # These entries are managed by SIS, please don't modify
            them.
192.168.0.100 <http://192.168.0.100> oscarnode01.mit.edu
            <http://oscarnode01.mit.edu>        oscarnode01


            Thanks for your help.

            Rob Ashcraft


            
-------------------------------------------------------------------------

            This SF.net email is sponsored by: Splunk Inc.
            Still grepping through log files to find problems?  Stop.
            Now Search log events and configuration files using AJAX
            and a browser.
            Download your FREE copy of Splunk now >>
            http://get.splunk.com/
            _______________________________________________
            Oscar-users mailing list
            Oscar-users@lists.sourceforge.net
            <mailto:Oscar-users@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/oscar-users



        ------------------------------------------------------------------------

        
-------------------------------------------------------------------------
        This SF.net email is sponsored by: Splunk Inc.
        Still grepping through log files to find problems?  Stop.

        Now Search log events and configuration files using AJAX and a browser.

Download your FREE copy of Splunk now >> http://get.splunk.com/
        ------------------------------------------------------------------------

        _______________________________________________
        Oscar-users mailing list

        Oscar-users@lists.sourceforge.net 
<mailto:Oscar-users@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/oscar-users
         <https://lists.sourceforge.net/lists/listinfo/oscar-users>


        
-------------------------------------------------------------------------
        This SF.net email is sponsored by: Splunk Inc.
        Still grepping through log files to find problems?  Stop.
        Now Search log events and configuration files using AJAX and
        a browser.
        Download your FREE copy of Splunk now >> http://get.splunk.com/
        _______________________________________________
        Oscar-users mailing list
        Oscar-users@lists.sourceforge.net
        <mailto:Oscar-users@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/oscar-users


    ------------------------------------------------------------------------

    -------------------------------------------------------------------------
    This SF.net email is sponsored by: Splunk Inc.
    Still grepping through log files to find problems?  Stop.
    Now Search log events and configuration files using AJAX and a browser.

    Download your FREE copy of Splunk now >> http://get.splunk.com/
    ------------------------------------------------------------------------

    _______________________________________________
    Oscar-users mailing list
    Oscar-users@lists.sourceforge.net
     <mailto:Oscar-users@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/oscar-users

--
    Robert W. Ashcraft

    Ph.D. Candidate

    Dept. Chemical Engineering

    Massachusetts Institute of Technology

    77 Massachusetts Ave.

    Room 66-264

    Cambridge, MA 02139

    Phone: 617-253-6554

    E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>


    -------------------------------------------------------------------------
    This SF.net email is sponsored by: Splunk Inc.
    Still grepping through log files to find problems?  Stop.
    Now Search log events and configuration files using AJAX and a
    browser.
    Download your FREE copy of Splunk now >> http://get.splunk.com/
    _______________________________________________
    Oscar-users mailing list
    Oscar-users@lists.sourceforge.net
    <mailto:Oscar-users@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/oscar-users


------------------------------------------------------------------------

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
------------------------------------------------------------------------

_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to