Hi Brandi Winfrey,
I am not sure if this is an answer for you but would like you to try this.
Please, run "post_install" in oscar/packages/opium/scripts/
It will synchronize all your nodes.
Thanks,
DongInn.
Brandi Winfrey wrote:
I found an email in the archives that had a similar problem, but could not find the
answer to the problem. I had a cluster of 9 computers that worked fine. I just
added 6 more nodes to the cluster. All nodes were added successfully, the
networking went fine, even the "Complete the Cluster Setup" section passed.
When I started the "Test Cluster Setup". First it was having issues with SSH
that had to do with passwords and "man-in-the-middle" attacks. I know this
is because I used ssh from one node to another before I should have and the
keys didn't match. Wasn't exactly sure how to fix this so I just deleted the
known_hosts file in the .ssh directory and ran /opt/opium/bin/sync_users --force
to try to reset all of the passwords. I was also getting prompted for a
password on the Test Cluster Setup even though I didn't set a password.
After doing this, I now can pass the SSH pingtest, the SSH server->node, and
the SSH node->server tests. I can't go any further than this without failing.
There is an error that there aren't enough free nodes.
I quit the test, and try a few things...
I CAN ssh to and from all of the nodes, but I get the following warning
"Warning: No xauth data; using fake authentication data for X11 forwarding."
My /etc/hosts file looks fine
/etc/hosts:---------------------------------------------------------------------------------
# Do not remove the following line, or various programs # that require network functionality will fail. 10.0.0.100 oscar oscar.oscardomain oscar_server nfs_oscar pbs_oscar 127.0.0.1 localhost localhost.oscardomain localhost 129.162.79.58 oscar oscar.geophysics.swri.edu
# These entries are managed by SIS, please don't modify them.
10.0.0.1 oscarnode1.oscardomain oscarnode1
10.0.0.10 oscarnode10.oscardomain oscarnode10
10.0.0.11 oscarnode11.oscardomain oscarnode11
10.0.0.12 oscarnode12.oscardomain oscarnode12
10.0.0.13 oscarnode13.oscardomain oscarnode13
10.0.0.14 oscarnode14.oscardomain oscarnode14
10.0.0.2 oscarnode2.oscardomain oscarnode2
10.0.0.3 oscarnode3.oscardomain oscarnode3
10.0.0.4 oscarnode4.oscardomain oscarnode4
10.0.0.5 oscarnode5.oscardomain oscarnode5
10.0.0.6 oscarnode6.oscardomain oscarnode6
10.0.0.7 oscarnode7.oscardomain oscarnode7
10.0.0.8 oscarnode8.oscardomain oscarnode8
10.0.0.9 oscarnode9.oscardomain oscarnode9
-------------------------------------------------------------------------------------------------
If I run pbsnodes -a, all of the nodes show up with the following (substitute the correct
node number where the 13 is):
oscarnode13.oscardomain
state = job-exclusive
np = 1
properties = all
ntype = cluster
jobs = 0/2.oscar
Oh, the job-exclusive and 0/2.oscar comments above are probably because I have
a run currently executing correctly on nodes 1-8 which still work correctly. The only
nodes that I can't get to cluster are the new nodes 9-14.
If I execute ifconfig -a, all of the ethernet cards are UP
The problem seems to be with MPI. When I run MPI I get the following error
(only with the new nodes):
rm_905: p4_error: Could not gethostbyname for host oscarnode9; may be invalid name
: 61
bm_list_1264: (81.330047) wakeup_slave: unable to interrupt slave 0 pid 1263
bm_list_1264: (81.330446) wakeup_slave: unable to interrupt slave 0 pid 1263
bm_list_1264: (81.330694) wakeup_slave: unable to interrupt slave 0 pid 1263
bm_list_1264: (81.330941) wakeup_slave: unable to interrupt slave 0 pid 1263
bm_list_1264: (81.331366) wakeup_slave: unable to interrupt slave 0 pid 1263
bm_list_1264: (81.331619) wakeup_slave: unable to interrupt slave 0 pid 1263
p9_762: (68.906620) net_recv failed for fd = 3
p9_762: p4_error: net_recv read, errno = : 104
p13_691: (61.793863) net_recv failed for fd = 3
p13_691: p4_error: net_recv read, errno = : 104
p14_691: (60.231699) net_recv failed for fd = 3
p14_691: p4_error: net_recv read, errno = : 104
p11_691: (65.034232) net_recv failed for fd = 3
p11_691: p4_error: net_recv read, errno = : 104
p12_691: (63.368020) net_recv failed for fd = 3
p12_691: p4_error: net_recv read, errno = : 104
p10_694: (66.745358) net_recv failed for fd = 3
p10_694: p4_error: net_recv read, errno = : 104
I looked at the file <eth_module>.o in /lib/modules/<kernal-version>/kernal/drivers/net
on the master and on the nodes. They are not the same, but when I look at this file
for the first 8 nodes that are working correctly, they are also not the same. Despite this,
I took some advice from the archives and copied the master's <eth_module>.o file to
the nodes (only the 6 new nodes) and rebooted the nodes. This seems to have done
nothing. Since I saved the original files, I'll probably just put it back the way it was.
I looked at the /etc/exports and /etc/fstab files. I didn't see anything wrong there.
/etc/fstab (nodes):-------------------------------------------------------------------
/dev/hda6 / ext2 defaults 1 2 /dev/hda5 swap swap defaults 0 0 /dev/hda1 /boot ext2 defaults 1 2 /dev/fd0 /mnt/floppy auto noauto,owner 0 0 none /dev/pts devpts defaults 0 0 none /proc proc defaults 0 0 nfs_oscar:/home /home nfs rw 0 0
/etc/fstab (master):---------------------------------------------------------------------
LABEL=/ / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
none /dev/pts devpts gid=5,mode=620 0 0
none /proc proc defaults 0 0
none /dev/shm tmpfs defaults 0 0
/dev/hda3 swap swap defaults 0 0
/dev/cdrom /mnt/cdrom udf,iso9660 noauto,owner,kudzu,ro 0 0
/dev/hdd4 /mnt/zip auto noauto,owner,kudzu 0 0
/dev/fd0 /mnt/floppy auto noauto,owner,kudzu 0 0
/etc/exports (master -- the nodes don't have one):----------------
/home 10.0.0.100/255.255.255.0(async,rw,no_root_squash)
---------------------------------------------------------------------------------------
Do you have any suggestions on how to fix this?
Thank you, Brandi
_________________________________________________________________
Check out Election 2004 for up-to-date election news, plus voter tools and more! http://special.msn.com/msn/election2004.armx
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
