Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Michael Bravo Wed, 27 Apr 2011 12:35:55 -0700

I think now I have observed the system for a few hours more, I can
generalize a bit, but as to 'df -h' output - it is identical, save for
a minimal difference in space actually free/used.


However, let me describe the setup in more detail.

There are 5 hosts in one datacenter, which comprise the cluster being
monitored and run gmond, and one in another, which runs web frontend
and gmetad.

let's say those are host1-host5 and then host-web.

The 5 hosts in question are just idling before being put under
production load, and so most of the metrics are near zero.

host1 is the collector - the other 4 hosts report via unicast to it.
host-web then polls it.

host-web(gmetad) <---------> host1 (gmond)
                                           ^---------host2 (gmond)
                                           ^---------host3 (gmond)
                                           ^---------host4 (gmond)
                                           ^---------host5 (gmond)

something like this

Now, during the day in this timezone, while some preproduction work
was being done at the hosts1-5, all of them but the problematic host3
had all of the default metrics reported and graphed. That was when I
first wrote to the list.

However, now that it is close to midnight here and most everyone has
gone home, I find that the ONLY host that has all of the default
metrics is the host1, the collector (which also listens), while others
lost everything but up/down state. Like this (physical view):


host5
Last heartbeat 10s
cpu: 0.00G mem: 0.00G

host4
Last heartbeat 10s
cpu: 0.00G mem: 0.00G

host3
Last heartbeat 1s
cpu: 0.00G mem: 0.00G

host2
Last heartbeat 8s
cpu: 1.95G (4) mem: 0.00G

host1
0.14
Last heartbeat 1s
cpu: 1.95G (4) mem: 7.80G


So, just out of pure speculation I could attribute this metric loss to
all metrics being under value_threshold, but.... what about
time_threshold? And why is the collector host holding onto its metrics
while others lost theirs but keep the heartbeats?

I feel confused, which is probably an indicator that I am missing
something obvious...

On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote:
> Hi Michael:
>
> You can try looking at the XML representation of the metric data from
> each of your gmonds to figure out what's different between them.  You
> can accomplish this by doing:
>
> nc localhost 8649 (assuming you are using the default gmond port of 8649)
>
> This should spit out all the metric data of all hosts gmond is aware of.
>
> What's the output of `df -h` on both systems, do they look different?
>
> Cheers,
>
> Bernard
>
> On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote:
>> More precisely, some metrics seem to be collected, and periodically
>> sent, such as
>>
>>       metric 'disk_free' being collected now
>> Counting device /dev/root (6.21 %)
>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>        metric 'disk_free' has value_threshold 1.000000
>>        metric 'part_max_used' being collected now
>> Counting device /dev/root (6.21 %)
>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>        metric 'part_max_used' has value_threshold 1.000000
>>
>>
>> and then (I think around time_threshold expiration)
>>
>>        sent message 'disk_free' of length 52 with 0 errors
>>        sent message 'part_max_used' of length 52 with 0 errors
>>
>> also, on startup all of these metrics seem to be prepared correctly:
>>
>>       sending metadata for metric: disk_free
>>        sent message 'disk_free' of length 52 with 0 errors
>>        sending metadata for metric: part_max_used
>>        sent message 'part_max_used' of length 52 with 0 error
>>
>> etc and so on
>>
>> but none of these metrics appear in the node report at the web
>> frontend, as I listed in original message
>>
>> where does the "Local disk: unknown" part coming from then?
>>
>> what is the most baffling, is that this problem host is completely
>> identical to the one next to it, which has zero problems
>>
>> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> wrote:
>>> I did try that, in non-daemonized mode, however there weren't any
>>> evident errors popping up (and there's a lot of information coming up
>>> that way), so perhaps I need an idea what to look for.
>>>
>>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote:
>>>> Have you tried stating up gmond on the effected server with debug set to
>>>> 10 in the gmond.conf? This may show some of the collection problems its
>>>> having more specifically....
>>>>
>>>> -RC
>>>>
>>>>
>>>> Ron Cavallo
>>>> Sr. Director, Infrastructure
>>>> Saks Fifth Avenue / Saks Direct
>>>> 12 East 49th Street
>>>> New York, NY 10017
>>>> 212-451-3807 (O)
>>>> 212-940-5079 (fax)
>>>> 646-315-0119(C)
>>>> www.saks.com
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Michael Bravo [mailto:mike.br...@gmail.com]
>>>> Sent: Wednesday, April 27, 2011 11:14 AM
>>>> To: ganglia-general
>>>> Subject: [Ganglia-general] two identical hosts,one is having trouble
>>>> with gmond
>>>>
>>>> Hello,
>>>>
>>>> here is a strange occurence. I have two (infact, more than two, but
>>>> let's consider just a pair) identical servers running identical setups
>>>> - identical OS, identical gmond with identical config files, identical
>>>> disks, identical everything. However, one of those servers is perfectly
>>>> well, and another one has trouble reporting default metrics.
>>>>
>>>> Here's what the "normal" one shows in node view:
>>>>
>>>> xx.xx.xx.172
>>>>
>>>> Location: Unknown
>>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5
>>>> seconds ago.
>>>> Uptime 9 days, 9:22:38
>>>> Load:   0.00    0.00    0.00
>>>> 1m      5m      15m
>>>>
>>>> CPU Utilization:        0.1     0.2     99.7
>>>> user    sys     idle
>>>> Hardware
>>>> CPUs: 4 x 1.95 GHz
>>>> Memory (RAM): 7.80 GB
>>>> Local Disk: Using 16.532 of 142.835 GB
>>>> Most Full Disk Partition: 11.6% used.   Software
>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>> Booted: April 18, 2011, 9:42 am
>>>> Uptime: 9 days, 9:22:38
>>>> Swap: Using 0.0 of 12001.6 MB swap.
>>>>
>>>>
>>>> and here's what the "problem one" shows:
>>>>
>>>> xx.xx.xx.171
>>>>
>>>> Location: Unknown
>>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10
>>>> seconds ago.
>>>> Uptime 9 days, 9:20:01
>>>> Load:   0.00    0.00    0.00
>>>> 1m      5m      15m
>>>>
>>>> CPU Utilization:        0.1     0.2     99.7
>>>> user    sys     idle
>>>> Hardware
>>>> CPUs: 4 x 1.95 GHz
>>>> Memory (RAM): 7.80 GB
>>>> Local Disk: Unknown
>>>> Most Full Disk Partition: 6.2% used.    Software
>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>> Booted: April 18, 2011, 9:47 am
>>>> Uptime: 9 days, 9:20:01
>>>> Swap: Using 12001.6 of 12001.6 MB swap.
>>>>
>>>>
>>>>
>>>> both are running gmond 3.1.7 and talk to a third host which also runs
>>>> gmond 3.1.7 (which is getting polled by the web frontend host with
>>>> gmetad 3.1.7)
>>>>
>>>> at a glance, there's something confusing gmond on the problem server, so
>>>> it mismatches disk partitions, or something.
>>>>
>>>> as a result, the problem node reports not all of the default metrics,
>>>> and those it does are somewhat off-kilter, as you can see (unknown local
>>>> disk?)
>>>>
>>>> Any idea what might be going wrong and/or how to pinpoint the problem?
>>>>
>>>> --
>>>> Michael Bravo
>>>>
>>>> ------------------------------------------------------------------------
>>>> ------
>>>> WhatsUp Gold - Download Free Network Management Software The most
>>>> intuitive, comprehensive, and cost-effective network management toolset
>>>> available today.  Delivers lowest initial acquisition cost and overall
>>>> TCO of any competing solution.
>>>> http://p.sf.net/sfu/whatsupgold-sd
>>>> _______________________________________________
>>>> Ganglia-general mailing list
>>>> Ganglia-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> WhatsUp Gold - Download Free Network Management Software
>> The most intuitive, comprehensive, and cost-effective network
>> management toolset available today.  Delivers lowest initial
>> acquisition cost and overall TCO of any competing solution.
>> http://p.sf.net/sfu/whatsupgold-sd
>> _______________________________________________
>> Ganglia-general mailing list
>> Ganglia-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>
>

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] two identical hosts, one is having trouble with gmond

Reply via email to