Cluster 5 Dual-Opteron tyan S2885 3Ware Sata raid
Switch 3Com gigabit
Fedora Core 3 x86_64 fresh install
Oscar 4.2.1b54423
-------------
29-march-2006
OSCAR 4.2.1b54423
------------------------------------------ INSTALLATION REPORT
------------------------------------------------
Finally, my oscar cluster install is on, but please, you will find some
error and workaround.
Below a resume :
1 - STEP 3 (OSCAR server packages) Failed dependancies (Fedora core 3
dependant ?)
2 - STEP 3 (OSCAR server packages) Eth0 not UP (not oscar
dependant)
3 - STEP 4 (Build OSCAR client image) Forced packages for i686/i386 failed
(some dependancies are UNKNOW)
4 - STEP 6 (CLIENT Installation) Rsync hang (pfilter must be
off before network deployment)
5 - STEP 8 (TEST CLUSTER SETUP) Ganglia Test failed (cause :
multicast default mode in gmond.
conf, should be unicast on my cluster and switch device)
Below Oscar log output and my workaround :
Many thanks and enjoy with Oscar !!
./install_cluster eth0
...
---------------------------------------------------------------------------------------
1 - STEP 3 (OSCAR server packages) Failed dependancies (Fedora core 3
dependant ?)
---------------------------------------------------------------------------------------
--> Returning oscar_server packages for apitest: elementtree Twisted
apitest-profiled apitest
--> Returning oscar_server packages for perl-Qt: perl-Qt
--> Returning oscar_server packages for sync_files: crontabs sync_files
--> Installing server core RPMs
warning: /tftpboot/rpm/apr-util-0.9.4-17.x86_64.rpm: V3 DSA signature:
NOKEY, key ID 4f2a6fd2
error: Failed dependencies:
libapr-0.so.0()(64bit) is needed by apr-util-0.9.4-17.x86_64
libapr-0.so.0()(64bit) is needed by httpd-2.0.52-3.x86_64
$pm->install ($dm->query_required_by ()) failed at ./wizard_prep line 312
Couldn't install packages needed for OSCAR Wizard to run at ./wizard_prep
line 312
Oscar Wizard preparation script failed to complete at ./install_cluster
line 212.
WORKAROUND :
in a shell, I type : rpm -U /home/tftpboot/rpm/apr-0.9.4-23.x86_64.rpm
./install_cluster eth0
...
---------------------------------------------------------------------------------------
1 - STEP 3 (OSCAR server packages) Failed dependancies (Fedora core 3
dependant ?)
---------------------------------------------------------------------------------------
...
--> Returning oscar_server packages for lam: lam-switcher-modulefile
lam-oscar-modulefile lam-oscar libaio-devel libaio
--> Installing server non-core RPMs (core RPMs already installed)
warning: /tftpboot/rpm/libaio-devel-0.3.102-1.x86_64.rpm: V3 DSA signature:
NOKEY, key ID 4f2a6fd2
error: Failed dependencies:
libaio.so.1.0.0()(64bit) is needed by lam-oscar-7.0.6-3.x86_64
$pm->install ($dm->query_required_by ()) failed at ./install_server line 73
Couldn't install the required packages needed for OSCAR at ./install_server
line 73
--> Step 3: Failed to properly install OSCAR server; please check the logs
WORKAROUND :
in a shell, I type : rpm -U /home/tftpboot/rpm/libaio-0.3.102-1.x86_64.rpm
---------------------------------------------------------------------------------------
2 - STEP 3 (OSCAR server packages) Eth0 not UP (not oscar
dependant)
---------------------------------------------------------------------------------------
...
Ganglia page is located at http://localhost/ganglia
--> Successfully ran server non-core package post_server_install scripts
--> Getting internal IP address
Cannot update hosts without a valid ip.
at ./install_server line 194
main::update_hosts('undef') called at ./install_server line 154
--> Got: [IP ]
--> Got: [broadcast ]
--> Got: [netmask ]
--> Adding hosts to /etc/hosts
--> Step 3: Failed to properly install OSCAR server; please check the logs
--> Update Wizard Env (as needed)
Update environment: ENV{MANPATH}
Update environment: ENV{PVM_RSH}
Update environment: ENV{PVM_ROOT}
Update environment: ENV{PVM_ARCH}
Update environment: ENV{PATH}
Update environment: ENV{_LMFILES_}
Update environment: ENV{LOADEDMODULES}
WORKAROUND : eth0 was not up and ifcfg-eth0 was bad
I edit by hand ifcfg-eth0 :
DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
TYPE=Ethernet
IPADDR=192.168.150.50
GATEWAY=192.168.150.253
BROADCAST=192.168.150.255
NETMASK=255.255.255.0
NETWORK=192.168.150.0
HWADDR=00:0e:0c:60:84:4f
---------------------------------------------------------------------------------------
3 - STEP 4 (Build OSCAR client image) Forced packages for i686/i386 failed
(some dependancies are UNKNOW)
---------------------------------------------------------------------------------------
...
<== OK gpm /tftpboot/rpm/gpm-1.20.1-66.x86_64.rpm -ihv --nodeps
<==
OK authconfig /tftpboot/rpm/authconfig-4.6.5-3.1.x86_64.rpm -ihv
--nodeps
<== OK glibc-headers /tftpboot/rpm/glibc-headers-2.3.3-74.x86_64.rpm
-ihv --nodeps
<==
OK sysklogd /tftpboot/rpm/sysklogd-1.4.1-22.x86_64.rpm -ihv
--nodeps
<==
OK torque-mom /tftpboot/rpm/torque-mom-1.2.0p5-2.x86_64.rpm -ihv
--nodeps
<==
OK module-init-tools
/tftpboot/rpm/module-init-tools-3.1-0.pre5.3.x86_64.rpm -ihv --nodeps
<== OK cpio /tftpboot/rpm/cpio-2.5-7.x86_64.rpm -ihv --nodeps
<== t=3s
<== $? 0
4: Forced packages for i686: glibc
==> /usr/bin/update-rpms '--root=none' '--cache=u' '--check' '--arch'
'i686' 'glibc'
<== NG glibc /tftpboot/rpm/glibc-2.3.3-74.i686.rpm requires basesystem
(UNKNOWN) requires glibc-common = 2.3.3-74 (UNKNOWN) requires
libgcc (UNKNOWN)
<== t=1s
<== $? 1
5: Forced packages for i386: tcl libstdc++ freetype fontconfig glibc-devel
xorg-x11-libs expat ncurses xorg-x11-Mesa-libGL zlib gpm libgcc
==> /usr/bin/update-rpms '--root=none' '--cache=u' '--check' '--arch'
'i386' 'tcl' 'libstdc++' 'freetype' 'fontconfig' 'glibc-devel'
'xorg-x11-libs' 'expat' 'ncurses' 'xorg-x11-Mesa-libGL' 'zlib' 'gpm' 'libgcc'
<== OK libgcc /tftpboot/rpm/libgcc-3.4.2-6.fc3.i386.rpm -ihv
<== NG expat /tftpboot/rpm/expat-1.95.7-4.i386.rpm requires
libc.so.6(GLIBC_2.1) (UNKNOWN) requires libc.so.6(GLIBC_2.0) (UNKNOWN)
WORKAROUND :
I specify Debug Mode with "export DEBUG_UPDATE_RPMS=1"
Very strange, I notice no more trouble after "start_over" and running
"install_cluster eth0" on a fresh login ????
But this is not reproductible !!!
Now I have always trouble with UNKNOW basesystem. Is this package in
tftpboot/rpm ? Yes.
Before step 4 "Build OSCAR Client Image"
In /usr/lib/systeminstaller/SystemInstaller/Package/UpdateRPMs.pm,
I add after line 99 :
for my $farch (keys %{$forced}) {
$cmd = "update-rpms --root=none --cache=u --list ";
The option --check is default for x86_64 rpm. I change to --list for forced
package i386 and i686 rpm.
---------------------------------------------------------------------------------------
4 - STEP 6 (CLIENT Installation) Rsync hang (pfilter must be
off before network deployment)
---------------------------------------------------------------------------------------
Rsync hang ??
I check /var/log/systemimager/rsyncd
[EMAIL PROTECTED] ~]# cat /var/log/systemimager/rsyncd
2006/03/28 04:21:09 [9409] rsyncd version 2.6.3 starting, listening on port 873
2006/03/28 04:36:24 [12692] rsync on
boot/x86_64/standard/boel_binaries.tar.gz f rom node1.cluster.ird.nc
(192.168.150.10)
2006/03/27 17:36:24 [12692] wrote 5003073 bytes read 122 bytes total size
5002 338
2006/03/28 04:36:25 [12693] rsync on scripts/ from node1.cluster.ird.nc
(192.168 .150.10)
2006/03/27 17:36:25 [12693] wrote 22186 bytes read 191 bytes total size 21567
2006/03/28 04:37:12 [12698] rsync on editr-bunch1 from node1.cluster.ird.nc
(192 .168.150.10)
2006/03/28 04:37:13 [12698] wrote 1057372 bytes read 81 bytes total size
89958 7892
2006/03/28 04:37:13 [12700] rsync on editr-bunch1/ from
node1.cluster.ird.nc (19 2.168.150.10)
2006/03/28 04:47:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:48:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:49:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:50:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:51:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:52:17 [12700] rsync error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28 04:52:43 [12700] rsync: writefd_unbuffered failed to write 69
bytes: phase "unknown" [sender]: Connection timed out (110)
2006/03/28 04:52:43 [12700] rsync error: error in rsync protocol data
stream (co de 12) at io.c(909)
WORKAROUND :
After some googling on oscar-users,
I stop "pfilter service" before cluster deployment, because there is some
interference with the
firewall.
---------------------------------------------------------------------------------------
5 - STEP 8 (TEST CLUSTER SETUP) Ganglia Test failed (cause :
multicast default mode in gmond.
---------------------------------------------------------------------------------------
Ganclia Setup Test failed ???
I check ganglia.err
[EMAIL PROTECTED] ganglia]# cat ganglia.err
Client nodes: node1.cluster.ird.nc node2.cluster.ird.nc
node3.cluster.ird.nc node4.cluster.ird.nc
Match pattern:
editr.cluster.ird.nc|node1.cluster.ird.nc|node2.cluster.ird.nc|node3.cluster.ird.nc|node4.cluster.ird.nc
Number of hosts matched: 1
Gstat output:
CLUSTER INFORMATION
Name: EDITR Cluster
Hosts: 1
Gexec Hosts: 0
Dead Hosts: 0
Localtime: Tue Mar 28 19:54:16 2006
CLUSTER HOSTS
Hostname LOAD CPU Gexec
CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle]
editr.cluster.ird.nc
2 ( 0/ 175) [ 0.10, 0.07, 0.02] [ 2.0, 0.0, 0.9, 96.9] OFF
The number of nodes expected is different from the number of nodes detected.
Check to see if gmond is running on all your nodes and make sure that you
are not having any network issues.
WORKAROUND :
After some googling on "Ganglia-general",
I comment all the Multicast entries in gmond.conf on my compute nodes and
master node, and add the
master node Ip, like :
udp_send_channel {
# mcast_join = 239.2.11.71
host = 192.168.150.50
port = 8649
}
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# bind = 239.2.11.71
}
------------------------------------------------------------------------------------------------
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users