Hi Michael: Can you please post gmond.conf (post a diff from stock config if it's too big or to pastebin.com) of one of the host and of the collector?
Also, did you set send_metadata_interval > 0? Cheers, Bernard On Wed, Apr 27, 2011 at 12:31 PM, Michael Bravo <mike.br...@gmail.com> wrote: > I think now I have observed the system for a few hours more, I can > generalize a bit, but as to 'df -h' output - it is identical, save for > a minimal difference in space actually free/used. > > However, let me describe the setup in more detail. > > There are 5 hosts in one datacenter, which comprise the cluster being > monitored and run gmond, and one in another, which runs web frontend > and gmetad. > > let's say those are host1-host5 and then host-web. > > The 5 hosts in question are just idling before being put under > production load, and so most of the metrics are near zero. > > host1 is the collector - the other 4 hosts report via unicast to it. > host-web then polls it. > > host-web(gmetad) <---------> host1 (gmond) > ^---------host2 (gmond) > ^---------host3 (gmond) > ^---------host4 (gmond) > ^---------host5 (gmond) > > something like this > > Now, during the day in this timezone, while some preproduction work > was being done at the hosts1-5, all of them but the problematic host3 > had all of the default metrics reported and graphed. That was when I > first wrote to the list. > > However, now that it is close to midnight here and most everyone has > gone home, I find that the ONLY host that has all of the default > metrics is the host1, the collector (which also listens), while others > lost everything but up/down state. Like this (physical view): > > > host5 > Last heartbeat 10s > cpu: 0.00G mem: 0.00G > > host4 > Last heartbeat 10s > cpu: 0.00G mem: 0.00G > > host3 > Last heartbeat 1s > cpu: 0.00G mem: 0.00G > > host2 > Last heartbeat 8s > cpu: 1.95G (4) mem: 0.00G > > host1 > 0.14 > Last heartbeat 1s > cpu: 1.95G (4) mem: 7.80G > > > So, just out of pure speculation I could attribute this metric loss to > all metrics being under value_threshold, but.... what about > time_threshold? And why is the collector host holding onto its metrics > while others lost theirs but keep the heartbeats? > > I feel confused, which is probably an indicator that I am missing > something obvious... > > On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote: >> Hi Michael: >> >> You can try looking at the XML representation of the metric data from >> each of your gmonds to figure out what's different between them. You >> can accomplish this by doing: >> >> nc localhost 8649 (assuming you are using the default gmond port of 8649) >> >> This should spit out all the metric data of all hosts gmond is aware of. >> >> What's the output of `df -h` on both systems, do they look different? >> >> Cheers, >> >> Bernard >> >> On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote: >>> More precisely, some metrics seem to be collected, and periodically >>> sent, such as >>> >>> metric 'disk_free' being collected now >>> Counting device /dev/root (6.21 %) >>> For all disks: 142.835 GB total, 133.963 GB free for users. >>> metric 'disk_free' has value_threshold 1.000000 >>> metric 'part_max_used' being collected now >>> Counting device /dev/root (6.21 %) >>> For all disks: 142.835 GB total, 133.963 GB free for users. >>> metric 'part_max_used' has value_threshold 1.000000 >>> >>> >>> and then (I think around time_threshold expiration) >>> >>> sent message 'disk_free' of length 52 with 0 errors >>> sent message 'part_max_used' of length 52 with 0 errors >>> >>> also, on startup all of these metrics seem to be prepared correctly: >>> >>> sending metadata for metric: disk_free >>> sent message 'disk_free' of length 52 with 0 errors >>> sending metadata for metric: part_max_used >>> sent message 'part_max_used' of length 52 with 0 error >>> >>> etc and so on >>> >>> but none of these metrics appear in the node report at the web >>> frontend, as I listed in original message >>> >>> where does the "Local disk: unknown" part coming from then? >>> >>> what is the most baffling, is that this problem host is completely >>> identical to the one next to it, which has zero problems >>> >>> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> wrote: >>>> I did try that, in non-daemonized mode, however there weren't any >>>> evident errors popping up (and there's a lot of information coming up >>>> that way), so perhaps I need an idea what to look for. >>>> >>>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote: >>>>> Have you tried stating up gmond on the effected server with debug set to >>>>> 10 in the gmond.conf? This may show some of the collection problems its >>>>> having more specifically.... >>>>> >>>>> -RC >>>>> >>>>> >>>>> Ron Cavallo >>>>> Sr. Director, Infrastructure >>>>> Saks Fifth Avenue / Saks Direct >>>>> 12 East 49th Street >>>>> New York, NY 10017 >>>>> 212-451-3807 (O) >>>>> 212-940-5079 (fax) >>>>> 646-315-0119(C) >>>>> www.saks.com >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Michael Bravo [mailto:mike.br...@gmail.com] >>>>> Sent: Wednesday, April 27, 2011 11:14 AM >>>>> To: ganglia-general >>>>> Subject: [Ganglia-general] two identical hosts,one is having trouble >>>>> with gmond >>>>> >>>>> Hello, >>>>> >>>>> here is a strange occurence. I have two (infact, more than two, but >>>>> let's consider just a pair) identical servers running identical setups >>>>> - identical OS, identical gmond with identical config files, identical >>>>> disks, identical everything. However, one of those servers is perfectly >>>>> well, and another one has trouble reporting default metrics. >>>>> >>>>> Here's what the "normal" one shows in node view: >>>>> >>>>> xx.xx.xx.172 >>>>> >>>>> Location: Unknown >>>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5 >>>>> seconds ago. >>>>> Uptime 9 days, 9:22:38 >>>>> Load: 0.00 0.00 0.00 >>>>> 1m 5m 15m >>>>> >>>>> CPU Utilization: 0.1 0.2 99.7 >>>>> user sys idle >>>>> Hardware >>>>> CPUs: 4 x 1.95 GHz >>>>> Memory (RAM): 7.80 GB >>>>> Local Disk: Using 16.532 of 142.835 GB >>>>> Most Full Disk Partition: 11.6% used. Software >>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>>> Booted: April 18, 2011, 9:42 am >>>>> Uptime: 9 days, 9:22:38 >>>>> Swap: Using 0.0 of 12001.6 MB swap. >>>>> >>>>> >>>>> and here's what the "problem one" shows: >>>>> >>>>> xx.xx.xx.171 >>>>> >>>>> Location: Unknown >>>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10 >>>>> seconds ago. >>>>> Uptime 9 days, 9:20:01 >>>>> Load: 0.00 0.00 0.00 >>>>> 1m 5m 15m >>>>> >>>>> CPU Utilization: 0.1 0.2 99.7 >>>>> user sys idle >>>>> Hardware >>>>> CPUs: 4 x 1.95 GHz >>>>> Memory (RAM): 7.80 GB >>>>> Local Disk: Unknown >>>>> Most Full Disk Partition: 6.2% used. Software >>>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>>> Booted: April 18, 2011, 9:47 am >>>>> Uptime: 9 days, 9:20:01 >>>>> Swap: Using 12001.6 of 12001.6 MB swap. >>>>> >>>>> >>>>> >>>>> both are running gmond 3.1.7 and talk to a third host which also runs >>>>> gmond 3.1.7 (which is getting polled by the web frontend host with >>>>> gmetad 3.1.7) >>>>> >>>>> at a glance, there's something confusing gmond on the problem server, so >>>>> it mismatches disk partitions, or something. >>>>> >>>>> as a result, the problem node reports not all of the default metrics, >>>>> and those it does are somewhat off-kilter, as you can see (unknown local >>>>> disk?) >>>>> >>>>> Any idea what might be going wrong and/or how to pinpoint the problem? >>>>> >>>>> -- >>>>> Michael Bravo >>>>> >>>>> ------------------------------------------------------------------------ >>>>> ------ >>>>> WhatsUp Gold - Download Free Network Management Software The most >>>>> intuitive, comprehensive, and cost-effective network management toolset >>>>> available today. Delivers lowest initial acquisition cost and overall >>>>> TCO of any competing solution. >>>>> http://p.sf.net/sfu/whatsupgold-sd >>>>> _______________________________________________ >>>>> Ganglia-general mailing list >>>>> Ganglia-general@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>>> >>>> >>> >>> ------------------------------------------------------------------------------ >>> WhatsUp Gold - Download Free Network Management Software >>> The most intuitive, comprehensive, and cost-effective network >>> management toolset available today. Delivers lowest initial >>> acquisition cost and overall TCO of any competing solution. >>> http://p.sf.net/sfu/whatsupgold-sd >>> _______________________________________________ >>> Ganglia-general mailing list >>> Ganglia-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>> >> > ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general