From: Lefevre Jerome [mailto:[EMAIL PROTECTED]
Sent: Tue 28/03/2006 19:51
To: Bernard Li; [email protected]
Subject: Re: [Oscar-users] RE: 4.2.1b54423 : REPORT on Fedora Core 3 X86_64 and Ganglia is ON
A 19:25 28/03/2006 -0800, Bernard Li a écrit :
Sorry,
but i didn't make a backup of oscar_install.log the first time
before to run
start_over. Please find the oscar_install.log for my last
Oscar
installation.
>Hey Jerome:
>
>The apr package is
listed in the "base" package's config.xml and the
>libaio package is
listed in "lam" package's config.xml - these RPMs should
>be automatically
installed by the system and you do not need to manually
>install
them.
>
>Can you post your full oscarinstall.log so that we can find
out what went
>wrong during the installation
process?
>
>Cheers,
>
>Bernard
>
>
>----------
>From:
Lefevre Jerome [mailto:[EMAIL PROTECTED]]
>Sent:
Tue 28/03/2006 17:25
>To: Bernard Li;
[email protected]
>Subject: 4.2.1b54423 : REPORT on Fedora
Core 3 X86_64 and Ganglia is ON
>
>
>Cluster 5 Dual-Opteron
tyan S2885 3Ware Sata raid
>Switch 3Com gigabit
>Fedora Core 3
x86_64 fresh install
>Oscar
4.2.1b54423
>-------------
>29-march-2006
>
>
>
>OSCAR
4.2.1b54423
>
>
------------------------------------------ INSTALLATION
REPORT
>------------------------------------------------
>
>Finally,
my oscar cluster install is on, but please, you will find some
>error and
workaround.
>
>Below a resume :
>
>1 - STEP 3 (OSCAR
server packages) Failed dependancies (Fedora core
3
>dependant ?)
>2 - STEP 3 (OSCAR server
packages) Eth0 not
UP (not
oscar
>dependant)
>3 - STEP 4 (Build OSCAR client image) Forced
packages for i686/i386 failed
>(some dependancies are UNKNOW)
>4 -
STEP 6 (CLIENT Installation) Rsync
hang (pfilter must
be
>off before network deployment)
>5 - STEP 8 (TEST CLUSTER
SETUP) Ganglia Test failed (cause
:
>multicast default mode in gmond.
>conf, should be unicast on my
cluster and switch device)
>
>
>Below Oscar log output and my
workaround :
>
>Many thanks and enjoy with Oscar
!!
>
>
>
>./install_cluster eth0
>
...
>---------------------------------------------------------------------------------------
>1
- STEP 3 (OSCAR server packages) Failed dependancies
(Fedora core 3
>dependant
?)
>---------------------------------------------------------------------------------------
>-->
Returning oscar_server packages for apitest: elementtree
Twisted
>apitest-profiled apitest
>--> Returning oscar_server
packages for perl-Qt: perl-Qt
>--> Returning oscar_server packages for
sync_files: crontabs sync_files
>--> Installing server core
RPMs
>warning: /tftpboot/rpm/apr-util-0.9.4-17.x86_64.rpm: V3 DSA
signature:
>NOKEY, key ID 4f2a6fd2
>error: Failed
dependencies:
>
libapr-0.so.0()(64bit) is needed by
apr-util-0.9.4-17.x86_64
>
libapr-0.so.0()(64bit) is needed by httpd-2.0.52-3.x86_64
>$pm->install
($dm->query_required_by ()) failed at ./wizard_prep line 312
>Couldn't
install packages needed for OSCAR Wizard to run at ./wizard_prep
>line
312
>Oscar Wizard preparation script failed to complete at
./install_cluster
>line 212.
>
>
>WORKAROUND :
>in
a shell, I type : rpm -U
/home/tftpboot/rpm/apr-0.9.4-23.x86_64.rpm
>
>
>
>./install_cluster
eth0
>...
>---------------------------------------------------------------------------------------
>1
- STEP 3 (OSCAR server packages) Failed dependancies
(Fedora core 3
>dependant
?)
>---------------------------------------------------------------------------------------
>...
>-->
Returning oscar_server packages for lam:
lam-switcher-modulefile
>lam-oscar-modulefile lam-oscar libaio-devel
libaio
>--> Installing server non-core RPMs (core RPMs already
installed)
>warning: /tftpboot/rpm/libaio-devel-0.3.102-1.x86_64.rpm: V3
DSA signature:
>NOKEY, key ID 4f2a6fd2
>error: Failed
dependencies:
>
libaio.so.1.0.0()(64bit) is needed by
lam-oscar-7.0.6-3.x86_64
>$pm->install ($dm->query_required_by ())
failed at ./install_server line 73
>Couldn't install the required packages
needed for OSCAR at ./install_server
>line 73
>--> Step 3: Failed
to properly install OSCAR server; please check the
logs
>
>WORKAROUND :
>in a shell, I type : rpm -U
/home/tftpboot/rpm/libaio-0.3.102-1.x86_64.rpm
>
>
>
>---------------------------------------------------------------------------------------
>2
- STEP 3 (OSCAR server packages) Eth0 not
UP (not
oscar
>dependant)
>---------------------------------------------------------------------------------------
>...
>Ganglia
page is located at <http://localhost/ganglia>http://localhost/ganglia
>-->
Successfully ran server non-core package post_server_install
scripts
>--> Getting internal IP address
>Cannot update hosts
without a valid ip.
> at ./install_server line
194
>
main::update_hosts('undef') called at ./install_server line 154
>-->
Got: [IP ]
>--> Got: [broadcast ]
>--> Got: [netmask
]
>--> Adding hosts to /etc/hosts
>--> Step 3: Failed to
properly install OSCAR server; please check the logs
>--> Update Wizard
Env (as needed)
>Update environment: ENV{MANPATH}
>Update
environment: ENV{PVM_RSH}
>Update environment: ENV{PVM_ROOT}
>Update
environment: ENV{PVM_ARCH}
>Update environment: ENV{PATH}
>Update
environment: ENV{_LMFILES_}
>Update environment:
ENV{LOADEDMODULES}
>
>WORKAROUND : eth0 was not up and ifcfg-eth0
was bad
>I edit by hand ifcfg-eth0
:
>
>DEVICE=eth0
>BOOTPROTO=static
>>>TYPE=Ethernet
>IPADDR=192.168.150.50
>GATEWAY=192.168.150.253
>BROADCAST=192.168.150.255
>NETMASK=255.255.255.0
>NETWORK=192.168.150.0
>HWADDR=00:0e:0c:60:84:4f
>
>
>
>---------------------------------------------------------------------------------------
>3
- STEP 4 (Build OSCAR client image) Forced packages for i686/i386
failed
>(some dependancies are
UNKNOW)
>---------------------------------------------------------------------------------------
>...
><==
OK gpm
/tftpboot/rpm/gpm-1.20.1-66.x86_64.rpm -ihv
--nodeps
><==
>OK authconfig
/tftpboot/rpm/authconfig-4.6.5-3.1.x86_64.rpm
-ihv
>--nodeps
><== OK glibc-headers
/tftpboot/rpm/glibc-headers-2.3.3-74.x86_64.rpm
>-ihv
--nodeps
><==
>OK
sysklogd
/tftpboot/rpm/sysklogd-1.4.1-22.x86_64.rpm
-ihv
>--nodeps
><==
>OK
torque-mom
/tftpboot/rpm/torque-mom-1.2.0p5-2.x86_64.rpm
-ihv
>--nodeps
><==
>OK
module-init-tools
>/tftpboot/rpm/module-init-tools-3.1-0.pre5.3.x86_64.rpm
-ihv --nodeps
><== OK cpio
/tftpboot/rpm/cpio-2.5-7.x86_64.rpm -ihv
--nodeps
><== t=3s
><== $? 0
>4: Forced packages for
i686: glibc
>==> /usr/bin/update-rpms '--root=none' '--cache=u'
'--check' '--arch'
>'i686' 'glibc'
><== NG
glibc /tftpboot/rpm/glibc-2.3.3-74.i686.rpm requires
basesystem
>(UNKNOWN) requires glibc-common = 2.3.3-74
(UNKNOWN) requires
>libgcc
(UNKNOWN)
><== t=1s
><== $? 1
>5: Forced packages for
i386: tcl libstdc++ freetype fontconfig glibc-devel
>xorg-x11-libs expat
ncurses xorg-x11-Mesa-libGL zlib gpm libgcc
>==> /usr/bin/update-rpms
'--root=none' '--cache=u' '--check' '--arch'
>'i386' 'tcl' 'libstdc++'
'freetype' 'fontconfig' 'glibc-devel'
>'xorg-x11-libs' 'expat' 'ncurses'
'xorg-x11-Mesa-libGL' 'zlib' 'gpm' 'libgcc'
><== OK libgcc
/tftpboot/rpm/libgcc-3.4.2-6.fc3.i386.rpm
-ihv
><== NG expat
/tftpboot/rpm/expat-1.95.7-4.i386.rpm
requires
>libc.so.6(GLIBC_2.1) (UNKNOWN) requires libc.so.6(GLIBC_2.0)
(UNKNOWN)
>
>
>WORKAROUND :
>I specify Debug Mode with
"export DEBUG_UPDATE_RPMS=1"
>Very strange, I notice no more trouble after
"start_over" and running
>"install_cluster eth0" on a fresh login
????
>But this is not reproductible !!!
>Now I have always trouble
with UNKNOW basesystem. Is this package in
>tftpboot/rpm ?
Yes.
>
>Before step 4 "Build OSCAR Client Image"
>In
/usr/lib/systeminstaller/SystemInstaller/Package/UpdateRPMs.pm,
>I add
after line 99
:
>
>
for my $farch (keys %{$forced})
{
>
$cmd = "update-rpms --root=none --cache=u --list ";
>
>The option
--check is default for x86_64 rpm. I change to --list for forced
>package
i386 and i686
rpm.
>
>
>
>---------------------------------------------------------------------------------------
>4
- STEP 6 (CLIENT Installation) Rsync
hang (pfilter must
be
>off before network
deployment)
>---------------------------------------------------------------------------------------
>
Rsync hang ??
>
>I check
/var/log/systemimager/rsyncd
>
>[EMAIL PROTECTED] ~]# cat
/var/log/systemimager/rsyncd
>2006/03/28 04:21:09 [9409] rsyncd version
2.6.3 starting, listening on
>port 873
>2006/03/28 04:36:24 [12692]
rsync on
>boot/x86_64/standard/boel_binaries.tar.gz f rom
node1.cluster.ird.nc
>(192.168.150.10)
>2006/03/27 17:36:24 [12692]
wrote 5003073 bytes read 122 bytes total size
>5002
338
>2006/03/28 04:36:25 [12693] rsync on scripts/ from
node1.cluster.ird.nc
>(192.168 .150.10)
>2006/03/27 17:36:25 [12693]
wrote 22186 bytes read 191 bytes total
size
>21567
>2006/03/28 04:37:12 [12698] rsync on editr-bunch1 from
node1.cluster.ird.nc
>(192 .168.150.10)
>2006/03/28 04:37:13 [12698]
wrote 1057372 bytes read 81 bytes total size
>89958
7892
>2006/03/28 04:37:13 [12700] rsync on editr-bunch1/
from
>node1.cluster.ird.nc (19 2.168.150.10)
>2006/03/28 04:47:17
[12700] rsync error: timeout in data send/receive (code
>30) at
io.c(153)
>2006/03/28 04:48:17 [12700] rsync error: timeout in data
send/receive (code
>30) at io.c(153)
>2006/03/28 04:49:17 [12700]
rsync error: timeout in data send/receive (code
>30) at
io.c(153)
>2006/03/28 04:50:17 [12700] rsync error: timeout in data
send/receive (code
>30) at io.c(153)
>2006/03/28 04:51:17 [12700]
rsync error: timeout in data send/receive (code
>30) at
io.c(153)
>2006/03/28 04:52:17 [12700] rsync error: timeout in data
send/receive (code
>30) at io.c(153)
>2006/03/28 04:52:43 [12700]
rsync: writefd_unbuffered failed to write 69
>bytes: phase "unknown"
[sender]: Connection timed out (110)
>2006/03/28 04:52:43 [12700] rsync
error: error in rsync protocol data
>stream (co de 12) at
io.c(909)
>
>WORKAROUND :
>After some googling on
oscar-users,
>I stop "pfilter service" before cluster deployment, because
there is some
>interference with
the
>firewall.
>
>
>---------------------------------------------------------------------------------------
>5
- STEP 8 (TEST CLUSTER SETUP) Ganglia Test
failed (cause :
>multicast default mode in
gmond.
>---------------------------------------------------------------------------------------
>Ganclia
Setup Test failed ???
>
>I check
ganglia.err
>
>[EMAIL PROTECTED] ganglia]# cat ganglia.err
>Client
nodes: node1.cluster.ird.nc node2.cluster.ird.nc
>node3.cluster.ird.nc
node4.cluster.ird.nc
>Match
pattern:
>editr.cluster.ird.nc|node1.cluster.ird.nc|node2.cluster.ird.nc|node3.cluster.ird.nc|node4.cluster.ird.nc
>Number
of hosts matched: 1
>Gstat output:
>CLUSTER
INFORMATION
> Name: EDITR
Cluster
> Hosts: 1
>Gexec
Hosts: 0
> Dead Hosts: 0
> Localtime:
Tue Mar 28 19:54:16 2006
>
>CLUSTER
HOSTS
>Hostname
LOAD
CPU
Gexec
> CPUs (Procs/Total) [
1, 5, 15min] [ User, Nice, System,
Idle]
>
>editr.cluster.ird.nc
>
2 ( 0/ 175) [ 0.10, 0.07, 0.02]
[ 2.0, 0.0, 0.9, 96.9]
OFF
>
>The number of nodes expected is different from the number of
nodes detected.
>Check to see if gmond is running on all your nodes and
make sure that you
>are not having any network
issues.
>
>
>WORKAROUND :
>After some googling on
"Ganglia-general",
>I comment all the Multicast entries in gmond.conf on
my compute nodes and
>master node, and add the
>master node Ip, like
:
>
>udp_send_channel {
> # mcast_join =
239.2.11.71
> host =
192.168.150.50
> port =
8649
>}
>
>udp_recv_channel {
> # mcast_join
= 239.2.11.71
> port = 8649
> # bind =
239.2.11.71
>}
>
>
>------------------------------------------------------------------------------------------------
>
