Hi!

I am not blaming OSCAR in any way. I just don't know where this error could
be originating from and I am trying to eliminate the options.

I have a small cluster (OSCAR 5.1rc18047, Fedora 8) and I was able to run
some application software on it. Then lightning struck very close to the
building. Fortunately I had unplugged all the power cables (because the
cluster has not yet been moved to where the power lines are protected), but
it seems that the institution didn't have any protection on their intranet
cables, and so the whole building's public network cards are damaged. A
costly lesson.

Anyway, when I tried to run the application software in parallel across the
cluster (using the private network which is unscathed) I get the following
error message:
*bind: Cannot assign requested address.*
I contacted the application software's help department as I thought I had
perhaps forgotten to set something, but according to them it is a normal
network problem. They gave some suggestions as to what the problem may be,
but I have checked it and it doesn't cure the problem.

I have included it here so that you don't waste time by suggesting the same
things.
 Quote:
  Check the /etc/hosts file and make sure that the nodes all have a
single definition and you don't have lines like

127.0.0.1 localhost normnode3

and that normnode3 has the same address both on the master and on the
node.

You can try

ping normnode3

from the master and see what address comes back

64 bytes from 164.190.57.105: icmp_seq=1 ttl=64 time=0.306 ms

or is it 127.0.0.1. Then do the reverse.

Also double check that you can ssh between nodes without password
but I would expect a different error then.
The command "hostname" returns gnlserv01, which is the public NIC.

After the lightning I had trouble getting the nodes to communicate
"automatically" with each other, but it can be cured by starting the xinetd
service (so that the nodes can boot across the network, otherwise I get tftp
errors) and disabling the firewall on the master node (it's not too
dangerous since I don't have a public interface at present and since I'm
sitting behind the institution's firewall as well.)

Is there a service that I need to start, or some port that I need to open?
Ganglia also doesn't work (doesn't show any stats) after the lightning. But
I guess it's because it was configured as: https//gnlserv01/ganglia and the
public NIC is dead... ?


Here is a copy of how my /etc/hosts file looks like:
Code:

# Do not remove the following line, or various programs
# that require network functionality will fail.

# These entries are managed by SIS, please don't modify them.
127.0.0.1       localhost.localdomain   localhost
192.168.1.254   snode0.oscardomain.za   snode0  oscar_server
nfs_oscar       pbs_oscar
abc.xyz.104.218 gnlserv01.ab.cx.yz      gnlserv01
192.168.1.1     normnode1.ab.cx.yz      normnode1
192.168.1.2     normnode2.ab.cx.yz      normnode2
192.168.1.3     normnode3.ab.cx.yz      normnode3

Here is the output of ifconfig -a:
 Code:

[compc...@gnlserv01 /root]$ ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:1C:C0:AF:10:18
          inet addr:192.168.1.254  Bcast:192.168.255.255  Mask:255.255.0.0
          inet6 addr: fe80::21c:c0ff:feaf:1018/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2587 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3109 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:332943 (325.1 KiB)  TX bytes:409521 (399.9 KiB)
          Base address:0x20c0 Memory:e0300000-e0320000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:5009 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5009 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3813184 (3.6 MiB)  TX bytes:3813184 (3.6 MiB)

I'm really clueless. I'm a chemist and I got this cluster to run somehow,
but it wasn't because I knew what I was doing.
I would greatly appreciate any suggestions and comments!

;-)

Rion
-- 
"For the Lord will not cast off forever, but, though He cause grief, He will
have compassion according to the abundance of His steadfast love; for He
does not willingly afflict or grieve the children of men." - Lamentations
3:31-33
------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to