On Sat, 2007-10-27 at 22:40 -0500, Carlo Marcelo Arenas Belon wrote: > On Sat, Oct 27, 2007 at 09:22:31PM -0400, Andrew Rowland wrote: > > Non-Working Node: x86 CLFS-1.0.0 with Linux 2.6.19.1 kernel. Ganglia > > built as: > > > > ./configure --prefix=/opt/ganglia --disable-gexec && make && make > > install > > so you are using gcc 4.1.1 and glibc 2.4 and nothing of interest to report > when compiled the included expat?
Yes, the only deviation is that I build the fortran compiler too which
required GMP and MPFR. That's the only deviation from the book.
Nothing interesting in the expat build.
> > > # gstat -a
> >
> > 40 files(1.1M bytes) - /usr/src/sys-cluster/ganglia/ganglia-3.0.5
> > [EMAIL PROTECTED] for 6D17h30m $ gstat -a
> > CLUSTER INFORMATION
> > Name: clusterfsck
> > Hosts: 1
>
> so the gmond for your x86 is dead even if it started.
>
> > weibullone.weibullnet.net
> > 2 ( 0/ 201) [ 2.59, 2.41, 2.10] [ 0.0, 62.1, 1.4, 36.5,
> > 0.0] OFF
>
> why the working box had a real FQDN name defined and the broken one has a fake
> non standard one?, can you define a good hostname in the weibullnet.net domain
> for the x86 and see if that helps?
No difference. But, I would still expect gstat to work on the x86 when
I SSH into it and execute locally. That's not the case though.
> > > > When I gstat -i 172.16.1.101 from the head node, I get the following and
> > > > the gmond daemon is killed on 172.16.1.101.
> > > >
> > > > gexec_cluster() XML_ParseBuffer() error at line 51:
> > > > no element found
> > >
> > > this means that gmond crashed because of a broken xml while using expat,
> > > can
> > > you paste the output of
> > >
> > > # lsof -p `pidof gmond`
> >
> > On the non-working node:
> >
> > 55 files() - /home/users/weibullguy/lsof_4.78/lsof_4.78_src
> > [EMAIL PROTECTED] for 0h16m $ ./lsof -p `pidof gmond`
> > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> > gmond 16485 root cwd DIR 3,3 4096 2 /
> > gmond 16485 root rtd DIR 3,3 4096 2 /
> > gmond 16485 root txt REG 3,3 634148 182171 /usr/sbin/gmond
> > gmond 16485 root mem REG 3,3 152573
> > 1172777 /lib/libnss_files-2.4.so
> > gmond 16485 root mem REG 3,3 6608752 1172766 /lib/libc-2.4.so
> > gmond 16485 root mem REG 3,3 572389
> > 1172773 /lib/libpthread-2.4.so
> > gmond 16485 root mem REG 3,3 404817 1172783 /lib/libnsl-2.4.so
> > gmond 16485 root mem REG 3,3 224864
> > 1172774 /lib/libresolv-2.4.so
> > gmond 16485 root mem REG 3,3 99796 1172770 /lib/libdl-2.4.so
> > gmond 16485 root mem REG 3,3 54376
> > 1172772 /lib/libcrypt-2.4.so
> > gmond 16485 root mem REG 3,3 565268 1172769 /lib/libm-2.4.so
> > gmond 16485 root mem REG 3,3 165063 1172778 /lib/librt-2.4.so
> > gmond 16485 root mem REG 3,3 549116 1172788 /lib/ld-2.4.so
> > gmond 16485 root 0r CHR 1,3 1023 /dev/null
> > gmond 16485 root 1w CHR 1,3 1023 /dev/null
> > gmond 16485 root 2w CHR 1,3 1023 /dev/null
> > gmond 16485 root 3u IPv4 241510 UDP 239.2.11.71:8649
> > gmond 16485 root 4u IPv4 241511 TCP *:8649 (LISTEN)
> > gmond 16485 root 5u IPv4 241512 UDP
> > legolas.clusterfsck:1027->239.2.11.71:8649
>
> another thing of interest is that you are not using DNS for host name
> resolution (pressume neither in the working box). see if adding "dns"
> to /etc/nsswitch.conf helps.
I am using dns. The /etc/nsswitch.conf is identical on both machines.
# Begin /etc/nsswitch.conf
passwd: files
group: files
shadow: files
hosts: dns files
networks: files
protocols: files
services: files
ethers: files
rpc: files
# End /etc/nsswitch.conf
> I can't reproduce the problem here (using similar names that you do and a
> similar configuration but with glibc 2.6 in a gentoo 2007.0 x86), getting the
> output of the XML generated by gmond until it crashes (with a telnet to 8649)
> or a core dump of gmond could probably help further.
The results of telnet localhost 8649, which crashes the gmond daemon
straight away, on the "non-working node."
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE GANGLIA_XML [
<!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
<!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
<!ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
<!ELEMENT GRID (CLUSTER | GRID | HOSTS | METRICS)*>
<!ATTLIST GRID NAME CDATA #REQUIRED>
<!ATTLIST GRID AUTHORITY CDATA #REQUIRED>
<!ATTLIST GRID LOCALTIME CDATA #IMPLIED>
<!ELEMENT CLUSTER (HOST | HOSTS | METRICS)*>
<!ATTLIST CLUSTER NAME CDATA #REQUIRED>
<!ATTLIST CLUSTER OWNER CDATA #IMPLIED>
<!ATTLIST CLUSTER LATLONG CDATA #IMPLIED>
<!ATTLIST CLUSTER URL CDATA #IMPLIED>
<!ATTLIST CLUSTER LOCALTIME CDATA #REQUIRED>
<!ELEMENT HOST (METRIC)*>
<!ATTLIST HOST NAME CDATA #REQUIRED>
<!ATTLIST HOST IP CDATA #REQUIRED>
<!ATTLIST HOST LOCATION CDATA #IMPLIED>
<!ATTLIST HOST REPORTED CDATA #REQUIRED>
<!ATTLIST HOST TN CDATA #IMPLIED>
<!ATTLIST HOST TMAX CDATA #IMPLIED>
<!ATTLIST HOST DMAX CDATA #IMPLIED>
<!ATTLIST HOST GMOND_STARTED CDATA #IMPLIED>
<!ELEMENT METRIC EMPTY>
<!ATTLIST METRIC NAME CDATA #REQUIRED>
<!ATTLIST METRIC VAL CDATA #REQUIRED>
<!ATTLIST METRIC TYPE (string | int8 | uint8 | int16 | uint16 |
int32 | uint32 | float | double | timestamp) #REQUIRED>
<!ATTLIST METRIC UNITS CDATA #IMPLIED>
<!ATTLIST METRIC TN CDATA #IMPLIED>
<!ATTLIST METRIC TMAX CDATA #IMPLIED>
<!ATTLIST METRIC DMAX CDATA #IMPLIED>
<!ATTLIST METRIC SLOPE (zero | positive | negative | both |
unspecified) #IMPLIED>
<!ATTLIST METRIC SOURCE (gmond | gmetric) #REQUIRED>
<!ELEMENT HOSTS EMPTY>
<!ATTLIST HOSTS UP CDATA #REQUIRED>
<!ATTLIST HOSTS DOWN CDATA #REQUIRED>
<!ATTLIST HOSTS SOURCE (gmond | gmetric | gmetad) #REQUIRED>
<!ELEMENT METRICS EMPTY>
<!ATTLIST METRICS NAME CDATA #REQUIRED>
<!ATTLIST METRICS SUM CDATA #REQUIRED>
<!ATTLIST METRICS NUM CDATA #REQUIRED>
<!ATTLIST METRICS TYPE (string | int8 | uint8 | int16 | uint16 |
int32 | uint32 | float | double | timestamp) #REQUIRED>
<!ATTLIST METRICS UNITS CDATA #IMPLIED>
<!ATTLIST METRICS SLOPE (zero | positive | negative | both |
unspecified) #IMPLIED>
<!ATTLIST METRICS SOURCE (gmond | gmetric) #REQUIRED>
]>
<GANGLIA_XML VERSION="3.0.5" SOURCE="gmond">
<CLUSTER NAME="clusterfsck" LOCALTIME="1193582978" OWNER="The ReliaFree
Project" LATLONG="unspecified" URL="http://reliafree.sourceforge.net">
<HOST NAME="legolas.clusterfsck" IP="172.16.1.101" REPORTED="1193582961"
TN="17" TMAX="20" DMAX="0" LOCATION="unspecified"
GMOND_STARTED="1193582961">
> Carlo
--
Andrew "Weibullguy" Rowland
Reliability & Safety Engineer
[EMAIL PROTECTED]
http://webpages.charter.net/weibullguy
http://reliafree.sourceforge.net
signature.asc
Description: This is a digitally signed message part
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

