From: Lefevre Jerome [mailto:[EMAIL PROTECTED]
Sent: Tue 28/03/2006 17:25
To: Bernard Li; [email protected]
Subject: 4.2.1b54423 : REPORT on Fedora Core 3 X86_64 and Ganglia is ON
Cluster 5 Dual-Opteron tyan S2885 3Ware Sata raid
Switch 3Com
gigabit
Fedora Core 3 x86_64 fresh install
Oscar
4.2.1b54423
-------------
29-march-2006
OSCAR
4.2.1b54423
------------------------------------------
INSTALLATION
REPORT
------------------------------------------------
Finally, my
oscar cluster install is on, but please, you will find some
error and
workaround.
Below a resume :
1 - STEP 3 (OSCAR server
packages) Failed dependancies (Fedora core
3
dependant ?)
2 - STEP 3 (OSCAR server packages) Eth0
not UP (not
oscar
dependant)
3 - STEP 4 (Build OSCAR client image) Forced packages for
i686/i386 failed
(some dependancies are UNKNOW)
4 - STEP 6 (CLIENT
Installation) Rsync
hang (pfilter must
be
off before network deployment)
5 - STEP 8 (TEST CLUSTER
SETUP) Ganglia Test failed (cause
:
multicast default mode in gmond.
conf, should be unicast on my cluster
and switch device)
Below Oscar log output and my workaround
:
Many thanks and enjoy with Oscar !!
./install_cluster
eth0
...
---------------------------------------------------------------------------------------
1
- STEP 3 (OSCAR server packages) Failed dependancies
(Fedora core 3
dependant
?)
---------------------------------------------------------------------------------------
-->
Returning oscar_server packages for apitest: elementtree
Twisted
apitest-profiled apitest
--> Returning oscar_server packages
for perl-Qt: perl-Qt
--> Returning oscar_server packages for sync_files:
crontabs sync_files
--> Installing server core RPMs
warning:
/tftpboot/rpm/apr-util-0.9.4-17.x86_64.rpm: V3 DSA signature:
NOKEY, key ID
4f2a6fd2
error: Failed
dependencies:
libapr-0.so.0()(64bit) is needed by
apr-util-0.9.4-17.x86_64
libapr-0.so.0()(64bit) is needed by httpd-2.0.52-3.x86_64
$pm->install
($dm->query_required_by ()) failed at ./wizard_prep line 312
Couldn't
install packages needed for OSCAR Wizard to run at ./wizard_prep
line
312
Oscar Wizard preparation script failed to complete at
./install_cluster
line 212.
WORKAROUND :
in a shell, I type
: rpm -U
/home/tftpboot/rpm/apr-0.9.4-23.x86_64.rpm
./install_cluster
eth0
...
---------------------------------------------------------------------------------------
1
- STEP 3 (OSCAR server packages) Failed dependancies
(Fedora core 3
dependant
?)
---------------------------------------------------------------------------------------
...
-->
Returning oscar_server packages for lam:
lam-switcher-modulefile
lam-oscar-modulefile lam-oscar libaio-devel
libaio
--> Installing server non-core RPMs (core RPMs already
installed)
warning: /tftpboot/rpm/libaio-devel-0.3.102-1.x86_64.rpm: V3 DSA
signature:
NOKEY, key ID 4f2a6fd2
error: Failed
dependencies:
libaio.so.1.0.0()(64bit) is needed by
lam-oscar-7.0.6-3.x86_64
$pm->install ($dm->query_required_by ())
failed at ./install_server line 73
Couldn't install the required packages
needed for OSCAR at ./install_server
line 73
--> Step 3: Failed to
properly install OSCAR server; please check the logs
WORKAROUND :
in a
shell, I type : rpm -U
/home/tftpboot/rpm/libaio-0.3.102-1.x86_64.rpm
---------------------------------------------------------------------------------------
2
- STEP 3 (OSCAR server packages) Eth0 not
UP (not
oscar
dependant)
---------------------------------------------------------------------------------------
...
Ganglia
page is located at http://localhost/ganglia
-->
Successfully ran server non-core package post_server_install scripts
-->
Getting internal IP address
Cannot update hosts without a valid ip.
at ./install_server line 194
main::update_hosts('undef') called at ./install_server line 154
--> Got:
[IP ]
--> Got: [broadcast ]
--> Got: [netmask ]
--> Adding
hosts to /etc/hosts
--> Step 3: Failed to properly install OSCAR server;
please check the logs
--> Update Wizard Env (as needed)
Update
environment: ENV{MANPATH}
Update environment: ENV{PVM_RSH}
Update
environment: ENV{PVM_ROOT}
Update environment: ENV{PVM_ARCH}
Update
environment: ENV{PATH}
Update environment: ENV{_LMFILES_}
Update
environment: ENV{LOADEDMODULES}
WORKAROUND : eth0 was not up and
ifcfg-eth0 was bad
I edit by hand ifcfg-eth0
:
DEVICE=eth0
BOOTPROTO=static
>TYPE=Ethernet
IPADDR=192.168.150.50
GATEWAY=192.168.150.253
BROADCAST=192.168.150.255
NETMASK=255.255.255.0
NETWORK=192.168.150.0
HWADDR=00:0e:0c:60:84:4f
---------------------------------------------------------------------------------------
3
- STEP 4 (Build OSCAR client image) Forced packages for i686/i386
failed
(some dependancies are
UNKNOW)
---------------------------------------------------------------------------------------
...
<==
OK gpm
/tftpboot/rpm/gpm-1.20.1-66.x86_64.rpm -ihv --nodeps
<==
OK
authconfig
/tftpboot/rpm/authconfig-4.6.5-3.1.x86_64.rpm
-ihv
--nodeps
<== OK glibc-headers
/tftpboot/rpm/glibc-headers-2.3.3-74.x86_64.rpm
-ihv
--nodeps
<==
OK
sysklogd
/tftpboot/rpm/sysklogd-1.4.1-22.x86_64.rpm
-ihv
--nodeps
<==
OK torque-mom
/tftpboot/rpm/torque-mom-1.2.0p5-2.x86_64.rpm
-ihv
--nodeps
<==
OK
module-init-tools
/tftpboot/rpm/module-init-tools-3.1-0.pre5.3.x86_64.rpm
-ihv --nodeps
<== OK cpio
/tftpboot/rpm/cpio-2.5-7.x86_64.rpm -ihv
--nodeps
<== t=3s
<== $? 0
4: Forced packages for i686:
glibc
==> /usr/bin/update-rpms '--root=none' '--cache=u' '--check'
'--arch'
'i686' 'glibc'
<== NG glibc
/tftpboot/rpm/glibc-2.3.3-74.i686.rpm requires
basesystem
(UNKNOWN) requires glibc-common = 2.3.3-74
(UNKNOWN) requires
libgcc
(UNKNOWN)
<== t=1s
<== $? 1
5: Forced packages for i386: tcl
libstdc++ freetype fontconfig glibc-devel
xorg-x11-libs expat ncurses
xorg-x11-Mesa-libGL zlib gpm libgcc
==> /usr/bin/update-rpms '--root=none'
'--cache=u' '--check' '--arch'
'i386' 'tcl' 'libstdc++' 'freetype'
'fontconfig' 'glibc-devel'
'xorg-x11-libs' 'expat' 'ncurses'
'xorg-x11-Mesa-libGL' 'zlib' 'gpm' 'libgcc'
<== OK libgcc
/tftpboot/rpm/libgcc-3.4.2-6.fc3.i386.rpm
-ihv
<== NG expat
/tftpboot/rpm/expat-1.95.7-4.i386.rpm
requires
libc.so.6(GLIBC_2.1) (UNKNOWN) requires libc.so.6(GLIBC_2.0)
(UNKNOWN)
WORKAROUND :
I specify Debug Mode with "export
DEBUG_UPDATE_RPMS=1"
Very strange, I notice no more trouble after
"start_over" and running
"install_cluster eth0" on a fresh login ????
But
this is not reproductible !!!
Now I have always trouble with UNKNOW
basesystem. Is this package in
tftpboot/rpm ? Yes.
Before step 4
"Build OSCAR Client Image"
In
/usr/lib/systeminstaller/SystemInstaller/Package/UpdateRPMs.pm,
I add after
line 99
:
for my $farch (keys %{$forced}) {
$cmd = "update-rpms --root=none
--cache=u --list ";
The option --check is default for x86_64 rpm. I
change to --list for forced
package i386 and i686
rpm.
---------------------------------------------------------------------------------------
4
- STEP 6 (CLIENT Installation) Rsync
hang (pfilter must
be
off before network
deployment)
---------------------------------------------------------------------------------------
Rsync hang ??
I check /var/log/systemimager/rsyncd
[EMAIL PROTECTED] ~]#
cat /var/log/systemimager/rsyncd
2006/03/28 04:21:09 [9409] rsyncd version
2.6.3 starting, listening on port 873
2006/03/28 04:36:24 [12692] rsync
on
boot/x86_64/standard/boel_binaries.tar.gz f rom
node1.cluster.ird.nc
(192.168.150.10)
2006/03/27 17:36:24 [12692] wrote
5003073 bytes read 122 bytes total size
5002 338
2006/03/28
04:36:25 [12693] rsync on scripts/ from node1.cluster.ird.nc
(192.168
.150.10)
2006/03/27 17:36:25 [12693] wrote 22186 bytes read 191
bytes total size 21567
2006/03/28 04:37:12 [12698] rsync on
editr-bunch1 from node1.cluster.ird.nc
(192 .168.150.10)
2006/03/28
04:37:13 [12698] wrote 1057372 bytes read 81 bytes total
size
89958 7892
2006/03/28 04:37:13 [12700] rsync on editr-bunch1/
from
node1.cluster.ird.nc (19 2.168.150.10)
2006/03/28 04:47:17 [12700]
rsync error: timeout in data send/receive (code
30) at
io.c(153)
2006/03/28 04:48:17 [12700] rsync error: timeout in data
send/receive (code
30) at io.c(153)
2006/03/28 04:49:17 [12700] rsync
error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28
04:50:17 [12700] rsync error: timeout in data send/receive (code
30) at
io.c(153)
2006/03/28 04:51:17 [12700] rsync error: timeout in data
send/receive (code
30) at io.c(153)
2006/03/28 04:52:17 [12700] rsync
error: timeout in data send/receive (code
30) at io.c(153)
2006/03/28
04:52:43 [12700] rsync: writefd_unbuffered failed to write 69
bytes: phase
"unknown" [sender]: Connection timed out (110)
2006/03/28 04:52:43 [12700]
rsync error: error in rsync protocol data
stream (co de 12) at
io.c(909)
WORKAROUND :
After some googling on oscar-users,
I stop
"pfilter service" before cluster deployment, because there is
some
interference with
the
firewall.
---------------------------------------------------------------------------------------
5
- STEP 8 (TEST CLUSTER SETUP) Ganglia Test
failed (cause :
multicast default mode in
gmond.
---------------------------------------------------------------------------------------
Ganclia
Setup Test failed ???
I check
ganglia.err
[EMAIL PROTECTED] ganglia]# cat ganglia.err
Client nodes:
node1.cluster.ird.nc node2.cluster.ird.nc
node3.cluster.ird.nc
node4.cluster.ird.nc
Match
pattern:
editr.cluster.ird.nc|node1.cluster.ird.nc|node2.cluster.ird.nc|node3.cluster.ird.nc|node4.cluster.ird.nc
Number
of hosts matched: 1
Gstat output:
CLUSTER
INFORMATION
Name: EDITR
Cluster
Hosts: 1
Gexec Hosts:
0
Dead Hosts: 0
Localtime: Tue Mar 28 19:54:16
2006
CLUSTER
HOSTS
Hostname
LOAD
CPU
Gexec
CPUs (Procs/Total) [
1, 5, 15min] [ User, Nice, System,
Idle]
editr.cluster.ird.nc
2
( 0/ 175) [ 0.10, 0.07, 0.02]
[ 2.0, 0.0, 0.9, 96.9] OFF
The
number of nodes expected is different from the number of nodes
detected.
Check to see if gmond is running on all your nodes and make sure
that you
are not having any network issues.
WORKAROUND :
After
some googling on "Ganglia-general",
I comment all the Multicast entries in
gmond.conf on my compute nodes and
master node, and add the
master node
Ip, like :
udp_send_channel {
# mcast_join =
239.2.11.71
host = 192.168.150.50
port =
8649
}
udp_recv_channel {
# mcast_join =
239.2.11.71
port = 8649
# bind =
239.2.11.71
}
------------------------------------------------------------------------------------------------
