Hi Michael:

Can you please post gmond.conf (post a diff from stock config if it's
too big or to pastebin.com) of one of the host and of the collector?

Also, did you set send_metadata_interval > 0?

Cheers,

Bernard

On Wed, Apr 27, 2011 at 12:31 PM, Michael Bravo <mike.br...@gmail.com> wrote:
> I think now I have observed the system for a few hours more, I can
> generalize a bit, but as to 'df -h' output - it is identical, save for
> a minimal difference in space actually free/used.
>
> However, let me describe the setup in more detail.
>
> There are 5 hosts in one datacenter, which comprise the cluster being
> monitored and run gmond, and one in another, which runs web frontend
> and gmetad.
>
> let's say those are host1-host5 and then host-web.
>
> The 5 hosts in question are just idling before being put under
> production load, and so most of the metrics are near zero.
>
> host1 is the collector - the other 4 hosts report via unicast to it.
> host-web then polls it.
>
> host-web(gmetad) <---------> host1 (gmond)
>                                           ^---------host2 (gmond)
>                                           ^---------host3 (gmond)
>                                           ^---------host4 (gmond)
>                                           ^---------host5 (gmond)
>
> something like this
>
> Now, during the day in this timezone, while some preproduction work
> was being done at the hosts1-5, all of them but the problematic host3
> had all of the default metrics reported and graphed. That was when I
> first wrote to the list.
>
> However, now that it is close to midnight here and most everyone has
> gone home, I find that the ONLY host that has all of the default
> metrics is the host1, the collector (which also listens), while others
> lost everything but up/down state. Like this (physical view):
>
>
> host5
> Last heartbeat 10s
> cpu: 0.00G mem: 0.00G
>
> host4
> Last heartbeat 10s
> cpu: 0.00G mem: 0.00G
>
> host3
> Last heartbeat 1s
> cpu: 0.00G mem: 0.00G
>
> host2
> Last heartbeat 8s
> cpu: 1.95G (4) mem: 0.00G
>
> host1
> 0.14
> Last heartbeat 1s
> cpu: 1.95G (4) mem: 7.80G
>
>
> So, just out of pure speculation I could attribute this metric loss to
> all metrics being under value_threshold, but.... what about
> time_threshold? And why is the collector host holding onto its metrics
> while others lost theirs but keep the heartbeats?
>
> I feel confused, which is probably an indicator that I am missing
> something obvious...
>
> On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote:
>> Hi Michael:
>>
>> You can try looking at the XML representation of the metric data from
>> each of your gmonds to figure out what's different between them.  You
>> can accomplish this by doing:
>>
>> nc localhost 8649 (assuming you are using the default gmond port of 8649)
>>
>> This should spit out all the metric data of all hosts gmond is aware of.
>>
>> What's the output of `df -h` on both systems, do they look different?
>>
>> Cheers,
>>
>> Bernard
>>
>> On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote:
>>> More precisely, some metrics seem to be collected, and periodically
>>> sent, such as
>>>
>>>       metric 'disk_free' being collected now
>>> Counting device /dev/root (6.21 %)
>>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>>        metric 'disk_free' has value_threshold 1.000000
>>>        metric 'part_max_used' being collected now
>>> Counting device /dev/root (6.21 %)
>>> For all disks: 142.835 GB total, 133.963 GB free for users.
>>>        metric 'part_max_used' has value_threshold 1.000000
>>>
>>>
>>> and then (I think around time_threshold expiration)
>>>
>>>        sent message 'disk_free' of length 52 with 0 errors
>>>        sent message 'part_max_used' of length 52 with 0 errors
>>>
>>> also, on startup all of these metrics seem to be prepared correctly:
>>>
>>>       sending metadata for metric: disk_free
>>>        sent message 'disk_free' of length 52 with 0 errors
>>>        sending metadata for metric: part_max_used
>>>        sent message 'part_max_used' of length 52 with 0 error
>>>
>>> etc and so on
>>>
>>> but none of these metrics appear in the node report at the web
>>> frontend, as I listed in original message
>>>
>>> where does the "Local disk: unknown" part coming from then?
>>>
>>> what is the most baffling, is that this problem host is completely
>>> identical to the one next to it, which has zero problems
>>>
>>> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> wrote:
>>>> I did try that, in non-daemonized mode, however there weren't any
>>>> evident errors popping up (and there's a lot of information coming up
>>>> that way), so perhaps I need an idea what to look for.
>>>>
>>>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote:
>>>>> Have you tried stating up gmond on the effected server with debug set to
>>>>> 10 in the gmond.conf? This may show some of the collection problems its
>>>>> having more specifically....
>>>>>
>>>>> -RC
>>>>>
>>>>>
>>>>> Ron Cavallo
>>>>> Sr. Director, Infrastructure
>>>>> Saks Fifth Avenue / Saks Direct
>>>>> 12 East 49th Street
>>>>> New York, NY 10017
>>>>> 212-451-3807 (O)
>>>>> 212-940-5079 (fax)
>>>>> 646-315-0119(C)
>>>>> www.saks.com
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael Bravo [mailto:mike.br...@gmail.com]
>>>>> Sent: Wednesday, April 27, 2011 11:14 AM
>>>>> To: ganglia-general
>>>>> Subject: [Ganglia-general] two identical hosts,one is having trouble
>>>>> with gmond
>>>>>
>>>>> Hello,
>>>>>
>>>>> here is a strange occurence. I have two (infact, more than two, but
>>>>> let's consider just a pair) identical servers running identical setups
>>>>> - identical OS, identical gmond with identical config files, identical
>>>>> disks, identical everything. However, one of those servers is perfectly
>>>>> well, and another one has trouble reporting default metrics.
>>>>>
>>>>> Here's what the "normal" one shows in node view:
>>>>>
>>>>> xx.xx.xx.172
>>>>>
>>>>> Location: Unknown
>>>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5
>>>>> seconds ago.
>>>>> Uptime 9 days, 9:22:38
>>>>> Load:   0.00    0.00    0.00
>>>>> 1m      5m      15m
>>>>>
>>>>> CPU Utilization:        0.1     0.2     99.7
>>>>> user    sys     idle
>>>>> Hardware
>>>>> CPUs: 4 x 1.95 GHz
>>>>> Memory (RAM): 7.80 GB
>>>>> Local Disk: Using 16.532 of 142.835 GB
>>>>> Most Full Disk Partition: 11.6% used.   Software
>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>>> Booted: April 18, 2011, 9:42 am
>>>>> Uptime: 9 days, 9:22:38
>>>>> Swap: Using 0.0 of 12001.6 MB swap.
>>>>>
>>>>>
>>>>> and here's what the "problem one" shows:
>>>>>
>>>>> xx.xx.xx.171
>>>>>
>>>>> Location: Unknown
>>>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10
>>>>> seconds ago.
>>>>> Uptime 9 days, 9:20:01
>>>>> Load:   0.00    0.00    0.00
>>>>> 1m      5m      15m
>>>>>
>>>>> CPU Utilization:        0.1     0.2     99.7
>>>>> user    sys     idle
>>>>> Hardware
>>>>> CPUs: 4 x 1.95 GHz
>>>>> Memory (RAM): 7.80 GB
>>>>> Local Disk: Unknown
>>>>> Most Full Disk Partition: 6.2% used.    Software
>>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64)
>>>>> Booted: April 18, 2011, 9:47 am
>>>>> Uptime: 9 days, 9:20:01
>>>>> Swap: Using 12001.6 of 12001.6 MB swap.
>>>>>
>>>>>
>>>>>
>>>>> both are running gmond 3.1.7 and talk to a third host which also runs
>>>>> gmond 3.1.7 (which is getting polled by the web frontend host with
>>>>> gmetad 3.1.7)
>>>>>
>>>>> at a glance, there's something confusing gmond on the problem server, so
>>>>> it mismatches disk partitions, or something.
>>>>>
>>>>> as a result, the problem node reports not all of the default metrics,
>>>>> and those it does are somewhat off-kilter, as you can see (unknown local
>>>>> disk?)
>>>>>
>>>>> Any idea what might be going wrong and/or how to pinpoint the problem?
>>>>>
>>>>> --
>>>>> Michael Bravo
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> ------
>>>>> WhatsUp Gold - Download Free Network Management Software The most
>>>>> intuitive, comprehensive, and cost-effective network management toolset
>>>>> available today.  Delivers lowest initial acquisition cost and overall
>>>>> TCO of any competing solution.
>>>>> http://p.sf.net/sfu/whatsupgold-sd
>>>>> _______________________________________________
>>>>> Ganglia-general mailing list
>>>>> Ganglia-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> WhatsUp Gold - Download Free Network Management Software
>>> The most intuitive, comprehensive, and cost-effective network
>>> management toolset available today.  Delivers lowest initial
>>> acquisition cost and overall TCO of any competing solution.
>>> http://p.sf.net/sfu/whatsupgold-sd
>>> _______________________________________________
>>> Ganglia-general mailing list
>>> Ganglia-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>>
>>
>

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to