Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Bernard Li Wed, 27 Apr 2011 10:59:05 -0700

Hi Michael:

You can try looking at the XML representation of the metric data from
each of your gmonds to figure out what's different between them.  You
can accomplish this by doing:


nc localhost 8649 (assuming you are using the default gmond port of 8649)

This should spit out all the metric data of all hosts gmond is aware of.

What's the output of `df -h` on both systems, do they look different?

Cheers,

Bernard

On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote:
> More precisely, some metrics seem to be collected, and periodically
> sent, such as
>
>       metric 'disk_free' being collected now
> Counting device /dev/root (6.21 %)
> For all disks: 142.835 GB total, 133.963 GB free for users.
>        metric 'disk_free' has value_threshold 1.000000
>        metric 'part_max_used' being collected now
> Counting device /dev/root (6.21 %)
> For all disks: 142.835 GB total, 133.963 GB free for users.
>        metric 'part_max_used' has value_threshold 1.000000
>
>
> and then (I think around time_threshold expiration)
>
>        sent message 'disk_free' of length 52 with 0 errors
>        sent message 'part_max_used' of length 52 with 0 errors
>
> also, on startup all of these metrics seem to be prepared correctly:
>
>       sending metadata for metric: disk_free
>        sent message 'disk_free' of length 52 with 0 errors
>        sending metadata for metric: part_max_used
>        sent message 'part_max_used' of length 52 with 0 error
>
> etc and so on
>
> but none of these metrics appear in the node report at the web
> frontend, as I listed in original message
>
> where does the "Local disk: unknown" part coming from then?
>
> what is the most baffling, is that this problem host is completely
> identical to the one next to it, which has zero problems
>
> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> wrote:
>> I did try that, in non-daemonized mode, however there weren't any
>> evident errors popping up (and there's a lot of information coming up
>> that way), so perhaps I need an idea what to look for.
>>
>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote:
>>> Have you tried stating up gmond on the effected server with debug set to
>>> 10 in the gmond.conf? This may show some of the collection problems its
>>> having more specifically....
>>>
>>> -RC
>>>
>>>
>>> Ron Cavallo
>>> Sr. Director, Infrastructure
>>> Saks Fifth Avenue / Saks Direct
>>> 12 East 49th Street
>>> New York, NY 10017
>>> 212-451-3807 (O)
>>> 212-940-5079 (fax)
>>> 646-315-0119(C)
>>> www.saks.com
>>>
>>>
>>> -----Original Message-----
>>> From: Michael Bravo [mailto:mike.br...@gmail.com]
>>> Sent: Wednesday, April 27, 2011 11:14 AM
>>> To: ganglia-general
>>> Subject: [Ganglia-general] two identical hosts,one is having trouble
>>> with gmond
>>>
>>> Hello,
>>>
>>> here is a strange occurence. I have two (infact, more than two, but
>>> let's consider just a pair) identical servers running identical setups
>>> - identical OS, identical gmond with identical config files, identical
>>> disks, identical everything. However, one of those servers is perfectly
>>> well, and another one has trouble reporting default metrics.
>>>
>>> Here's what the "normal" one shows in node view:
>>>
>>> xx.xx.xx.172
>>>
>>> Location: Unknown
>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5
>>> seconds ago.
>>> Uptime 9 days, 9:22:38
>>> Load:   0.00    0.00    0.00
>>> 1m      5m      15m
>>>
>>> CPU Utilization:        0.1     0.2     99.7
>>> user    sys     idle
>>> Hardware
>>> CPUs: 4 x 1.95 GHz
>>> Memory (RAM): 7.80 GB
>>> Local Disk: Using 16.532 of 142.835 GB
>>> Most Full Disk Partition: 11.6% used.   Software
>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>> Booted: April 18, 2011, 9:42 am
>>> Uptime: 9 days, 9:22:38
>>> Swap: Using 0.0 of 12001.6 MB swap.
>>>
>>>
>>> and here's what the "problem one" shows:
>>>
>>> xx.xx.xx.171
>>>
>>> Location: Unknown
>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10
>>> seconds ago.
>>> Uptime 9 days, 9:20:01
>>> Load:   0.00    0.00    0.00
>>> 1m      5m      15m
>>>
>>> CPU Utilization:        0.1     0.2     99.7
>>> user    sys     idle
>>> Hardware
>>> CPUs: 4 x 1.95 GHz
>>> Memory (RAM): 7.80 GB
>>> Local Disk: Unknown
>>> Most Full Disk Partition: 6.2% used.    Software
>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>> Booted: April 18, 2011, 9:47 am
>>> Uptime: 9 days, 9:20:01
>>> Swap: Using 12001.6 of 12001.6 MB swap.
>>>
>>>
>>>
>>> both are running gmond 3.1.7 and talk to a third host which also runs
>>> gmond 3.1.7 (which is getting polled by the web frontend host with
>>> gmetad 3.1.7)
>>>
>>> at a glance, there's something confusing gmond on the problem server, so
>>> it mismatches disk partitions, or something.
>>>
>>> as a result, the problem node reports not all of the default metrics,
>>> and those it does are somewhat off-kilter, as you can see (unknown local
>>> disk?)
>>>
>>> Any idea what might be going wrong and/or how to pinpoint the problem?
>>>
>>> --
>>> Michael Bravo
>>>
>>> ------------------------------------------------------------------------
>>> ------
>>> WhatsUp Gold - Download Free Network Management Software The most
>>> intuitive, comprehensive, and cost-effective network management toolset
>>> available today.  Delivers lowest initial acquisition cost and overall
>>> TCO of any competing solution.
>>> http://p.sf.net/sfu/whatsupgold-sd
>>> _______________________________________________
>>> Ganglia-general mailing list
>>> Ganglia-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>
>>
>
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network
> management toolset available today.  Delivers lowest initial
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Reply via email to