On Sat, 2007-10-27 at 22:40 -0500, Carlo Marcelo Arenas Belon wrote:
> On Sat, Oct 27, 2007 at 09:22:31PM -0400, Andrew Rowland wrote:
> > Non-Working Node: x86 CLFS-1.0.0 with Linux 2.6.19.1 kernel.  Ganglia
> > built as:
> > 
> >     ./configure --prefix=/opt/ganglia --disable-gexec && make && make
> > install
> 
> so you are using gcc 4.1.1 and glibc 2.4 and nothing of interest to report
> when compiled the included expat?

Yes, the only deviation is that I build the fortran compiler too which
required GMP and MPFR.  That's the only deviation from the book.
Nothing interesting in the expat build.

> > > # gstat -a
> > 
> > 40 files(1.1M bytes) - /usr/src/sys-cluster/ganglia/ganglia-3.0.5
> > [EMAIL PROTECTED] for 6D17h30m $ gstat -a
> > CLUSTER INFORMATION
> >        Name: clusterfsck
> >       Hosts: 1
> 
> so the gmond for your x86 is dead even if it started.
> 
> > weibullone.weibullnet.net
> >     2 (    0/  201) [  2.59,  2.41,  2.10] [   0.0,  62.1,   1.4,  36.5,
> > 0.0] OFF
> 
> why the working box had a real FQDN name defined and the broken one has a fake
> non standard one?, can you define a good hostname in the weibullnet.net domain
> for the x86 and see if that helps?

No difference.  But, I would still expect gstat to work on the x86 when
I SSH into it and execute locally.  That's not the case though.

> > > > When I gstat -i 172.16.1.101 from the head node, I get the following and
> > > > the gmond daemon is killed on 172.16.1.101.
> > > > 
> > > >         gexec_cluster() XML_ParseBuffer() error at line 51:
> > > >         no element found
> > > 
> > > this means that gmond crashed because of a broken xml while using expat, 
> > > can
> > > you paste the output of 
> > > 
> > > # lsof -p `pidof gmond`
> > 
> > On the non-working node:
> > 
> > 55 files() - /home/users/weibullguy/lsof_4.78/lsof_4.78_src
> > [EMAIL PROTECTED] for 0h16m $ ./lsof -p `pidof gmond`
> > COMMAND   PID USER   FD   TYPE DEVICE    SIZE    NODE NAME
> > gmond   16485 root  cwd    DIR    3,3    4096       2 /
> > gmond   16485 root  rtd    DIR    3,3    4096       2 /
> > gmond   16485 root  txt    REG    3,3  634148  182171 /usr/sbin/gmond
> > gmond   16485 root  mem    REG    3,3  152573
> > 1172777 /lib/libnss_files-2.4.so
> > gmond   16485 root  mem    REG    3,3 6608752 1172766 /lib/libc-2.4.so
> > gmond   16485 root  mem    REG    3,3  572389
> > 1172773 /lib/libpthread-2.4.so
> > gmond   16485 root  mem    REG    3,3  404817 1172783 /lib/libnsl-2.4.so
> > gmond   16485 root  mem    REG    3,3  224864
> > 1172774 /lib/libresolv-2.4.so
> > gmond   16485 root  mem    REG    3,3   99796 1172770 /lib/libdl-2.4.so
> > gmond   16485 root  mem    REG    3,3   54376
> > 1172772 /lib/libcrypt-2.4.so
> > gmond   16485 root  mem    REG    3,3  565268 1172769 /lib/libm-2.4.so
> > gmond   16485 root  mem    REG    3,3  165063 1172778 /lib/librt-2.4.so
> > gmond   16485 root  mem    REG    3,3  549116 1172788 /lib/ld-2.4.so
> > gmond   16485 root    0r   CHR    1,3            1023 /dev/null
> > gmond   16485 root    1w   CHR    1,3            1023 /dev/null
> > gmond   16485 root    2w   CHR    1,3            1023 /dev/null
> > gmond   16485 root    3u  IPv4 241510             UDP 239.2.11.71:8649 
> > gmond   16485 root    4u  IPv4 241511             TCP *:8649 (LISTEN)
> > gmond   16485 root    5u  IPv4 241512             UDP
> > legolas.clusterfsck:1027->239.2.11.71:8649 
> 
> another thing of interest is that you are not using DNS for host name
> resolution (pressume neither in the working box).  see if adding "dns"
> to /etc/nsswitch.conf helps.

I am using dns.  The /etc/nsswitch.conf is identical on both machines.  

# Begin /etc/nsswitch.conf

passwd: files
group: files
shadow: files

hosts: dns files
networks: files

protocols: files
services: files
ethers: files
rpc: files

# End /etc/nsswitch.conf

> I can't reproduce the problem here (using similar names that you do and a
> similar configuration but with glibc 2.6 in a gentoo 2007.0 x86), getting the
> output of the XML generated by gmond until it crashes (with a telnet to 8649)
> or a core dump of gmond could probably help further.

The results of telnet localhost 8649, which crashes the gmond daemon
straight away, on the "non-working node."

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE GANGLIA_XML [
   <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>
      <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>
      <!ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED>
   <!ELEMENT GRID (CLUSTER | GRID | HOSTS | METRICS)*>
      <!ATTLIST GRID NAME CDATA #REQUIRED>
      <!ATTLIST GRID AUTHORITY CDATA #REQUIRED>
      <!ATTLIST GRID LOCALTIME CDATA #IMPLIED>
   <!ELEMENT CLUSTER (HOST | HOSTS | METRICS)*>
      <!ATTLIST CLUSTER NAME CDATA #REQUIRED>
      <!ATTLIST CLUSTER OWNER CDATA #IMPLIED>
      <!ATTLIST CLUSTER LATLONG CDATA #IMPLIED>
      <!ATTLIST CLUSTER URL CDATA #IMPLIED>
      <!ATTLIST CLUSTER LOCALTIME CDATA #REQUIRED>
   <!ELEMENT HOST (METRIC)*>
      <!ATTLIST HOST NAME CDATA #REQUIRED>
      <!ATTLIST HOST IP CDATA #REQUIRED>
      <!ATTLIST HOST LOCATION CDATA #IMPLIED>
      <!ATTLIST HOST REPORTED CDATA #REQUIRED>
      <!ATTLIST HOST TN CDATA #IMPLIED>
      <!ATTLIST HOST TMAX CDATA #IMPLIED>
      <!ATTLIST HOST DMAX CDATA #IMPLIED>
      <!ATTLIST HOST GMOND_STARTED CDATA #IMPLIED>
   <!ELEMENT METRIC EMPTY>
      <!ATTLIST METRIC NAME CDATA #REQUIRED>
      <!ATTLIST METRIC VAL CDATA #REQUIRED>
      <!ATTLIST METRIC TYPE (string | int8 | uint8 | int16 | uint16 |
int32 | uint32 | float | double | timestamp) #REQUIRED>
      <!ATTLIST METRIC UNITS CDATA #IMPLIED>
      <!ATTLIST METRIC TN CDATA #IMPLIED>
      <!ATTLIST METRIC TMAX CDATA #IMPLIED>
      <!ATTLIST METRIC DMAX CDATA #IMPLIED>
      <!ATTLIST METRIC SLOPE (zero | positive | negative | both |
unspecified) #IMPLIED>
      <!ATTLIST METRIC SOURCE (gmond | gmetric) #REQUIRED>
   <!ELEMENT HOSTS EMPTY>
      <!ATTLIST HOSTS UP CDATA #REQUIRED>
      <!ATTLIST HOSTS DOWN CDATA #REQUIRED>
      <!ATTLIST HOSTS SOURCE (gmond | gmetric | gmetad) #REQUIRED>
   <!ELEMENT METRICS EMPTY>
      <!ATTLIST METRICS NAME CDATA #REQUIRED>
      <!ATTLIST METRICS SUM CDATA #REQUIRED>
      <!ATTLIST METRICS NUM CDATA #REQUIRED>
      <!ATTLIST METRICS TYPE (string | int8 | uint8 | int16 | uint16 |
int32 | uint32 | float | double | timestamp) #REQUIRED>
      <!ATTLIST METRICS UNITS CDATA #IMPLIED>
      <!ATTLIST METRICS SLOPE (zero | positive | negative | both |
unspecified) #IMPLIED>
      <!ATTLIST METRICS SOURCE (gmond | gmetric) #REQUIRED>
]>
<GANGLIA_XML VERSION="3.0.5" SOURCE="gmond">
<CLUSTER NAME="clusterfsck" LOCALTIME="1193582978" OWNER="The ReliaFree
Project" LATLONG="unspecified" URL="http://reliafree.sourceforge.net";>
<HOST NAME="legolas.clusterfsck" IP="172.16.1.101" REPORTED="1193582961"
TN="17" TMAX="20" DMAX="0" LOCATION="unspecified"
GMOND_STARTED="1193582961">

> Carlo
-- 
Andrew "Weibullguy" Rowland
Reliability & Safety Engineer

[EMAIL PROTECTED]
http://webpages.charter.net/weibullguy
http://reliafree.sourceforge.net

Attachment: signature.asc
Description: This is a digitally signed message part

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to