Thanks for the quick reply...

On Sat, 2007-10-27 at 19:18 -0500, Carlo Marcelo Arenas Belon wrote:
> On Sat, Oct 27, 2007 at 04:42:00PM -0400, Andrew Rowland wrote:
> > I have just installed Ganglia-3.0.5.  Configured without gexec on both
> > machines and with gmetad on one, but not the other.  I am able to start
> > gmond and gmetad with no errors.  But I am having problems on one of my
> > machines with gmond.
> 
> which OS/arch/release?, if Linux, which distribution and if not using the
> distribution provided packages, what options were used to build it?

"Head" Node (Yes, the one with gmetad): x86_64 with multilib CLFS-1.0.0
with Linux 2.6.19.1 kernel.  Ganglia built as:

        CC="gcc ${BUILD64}" PKG_CONFIG_PATH="${PKG_CONFIG_PATH64}" ./configure
--prefix=/opt/ganglia --libdir=/opt/ganglia/lib64 --with-gmetad
--disable-gexec && make && make install

Non-Working Node: x86 CLFS-1.0.0 with Linux 2.6.19.1 kernel.  Ganglia
built as:

        ./configure --prefix=/opt/ganglia --disable-gexec && make && make
install

> > Issuing gstat on the head node gives the following:
> 
> you mean the head node is the one that has gmetad?, both are technically head
> nodes based on your configuration as you are using multicast and enabling TCP
> (so that gmetad can poll them).
> 
> >     CLUSTER INFORMATION
> >             Name: clusterfsck
> >     Hosts: 1
> >     Gexec Hosts: 0
> >     Dead Hosts: 0
> >     Localtime: Sat Oct 27 16:34:15 2007
> > 
> >     There are no hosts running gexec at this time
> 
> what do you get if running, I suspect you will only see 1 host, which is 
> the one you are polling. 
> 
> # gstat -a

40 files(1.1M bytes) - /usr/src/sys-cluster/ganglia/ganglia-3.0.5
[EMAIL PROTECTED] for 6D17h30m $ gstat -a
CLUSTER INFORMATION
       Name: clusterfsck
      Hosts: 1
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Sat Oct 27 20:49:50 2007

CLUSTER HOSTS
Hostname                     LOAD                       CPU
Gexec
 CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle,
Wio]

weibullone.weibullnet.net
    2 (    0/  201) [  2.59,  2.41,  2.10] [   0.0,  62.1,   1.4,  36.5,
0.0] OFF

> > When I gstat -i 172.16.1.101 from the head node, I get the following and
> > the gmond daemon is killed on 172.16.1.101.
> > 
> >     gexec_cluster() XML_ParseBuffer() error at line 51:
> >     no element found
> 
> this means that gmond crashed because of a broken xml while using expat, can
> you paste the output of 
> 
> # lsof -p `pidof gmond`

On the non-working node:

55 files() - /home/users/weibullguy/lsof_4.78/lsof_4.78_src
[EMAIL PROTECTED] for 0h16m $ ./lsof -p `pidof gmond`
COMMAND   PID USER   FD   TYPE DEVICE    SIZE    NODE NAME
gmond   16485 root  cwd    DIR    3,3    4096       2 /
gmond   16485 root  rtd    DIR    3,3    4096       2 /
gmond   16485 root  txt    REG    3,3  634148  182171 /usr/sbin/gmond
gmond   16485 root  mem    REG    3,3  152573
1172777 /lib/libnss_files-2.4.so
gmond   16485 root  mem    REG    3,3 6608752 1172766 /lib/libc-2.4.so
gmond   16485 root  mem    REG    3,3  572389
1172773 /lib/libpthread-2.4.so
gmond   16485 root  mem    REG    3,3  404817 1172783 /lib/libnsl-2.4.so
gmond   16485 root  mem    REG    3,3  224864
1172774 /lib/libresolv-2.4.so
gmond   16485 root  mem    REG    3,3   99796 1172770 /lib/libdl-2.4.so
gmond   16485 root  mem    REG    3,3   54376
1172772 /lib/libcrypt-2.4.so
gmond   16485 root  mem    REG    3,3  565268 1172769 /lib/libm-2.4.so
gmond   16485 root  mem    REG    3,3  165063 1172778 /lib/librt-2.4.so
gmond   16485 root  mem    REG    3,3  549116 1172788 /lib/ld-2.4.so
gmond   16485 root    0r   CHR    1,3            1023 /dev/null
gmond   16485 root    1w   CHR    1,3            1023 /dev/null
gmond   16485 root    2w   CHR    1,3            1023 /dev/null
gmond   16485 root    3u  IPv4 241510             UDP 239.2.11.71:8649 
gmond   16485 root    4u  IPv4 241511             TCP *:8649 (LISTEN)
gmond   16485 root    5u  IPv4 241512             UDP
legolas.clusterfsck:1027->239.2.11.71:8649 

> Carlo
-- 
Andrew "Weibullguy" Rowland
Reliability & Safety Engineer

[EMAIL PROTECTED]
http://webpages.charter.net/weibullguy
http://reliafree.sourceforge.net

Attachment: signature.asc
Description: This is a digitally signed message part

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to