I think now I have observed the system for a few hours more, I can generalize a bit, but as to 'df -h' output - it is identical, save for a minimal difference in space actually free/used.
However, let me describe the setup in more detail. There are 5 hosts in one datacenter, which comprise the cluster being monitored and run gmond, and one in another, which runs web frontend and gmetad. let's say those are host1-host5 and then host-web. The 5 hosts in question are just idling before being put under production load, and so most of the metrics are near zero. host1 is the collector - the other 4 hosts report via unicast to it. host-web then polls it. host-web(gmetad) <---------> host1 (gmond) ^---------host2 (gmond) ^---------host3 (gmond) ^---------host4 (gmond) ^---------host5 (gmond) something like this Now, during the day in this timezone, while some preproduction work was being done at the hosts1-5, all of them but the problematic host3 had all of the default metrics reported and graphed. That was when I first wrote to the list. However, now that it is close to midnight here and most everyone has gone home, I find that the ONLY host that has all of the default metrics is the host1, the collector (which also listens), while others lost everything but up/down state. Like this (physical view): host5 Last heartbeat 10s cpu: 0.00G mem: 0.00G host4 Last heartbeat 10s cpu: 0.00G mem: 0.00G host3 Last heartbeat 1s cpu: 0.00G mem: 0.00G host2 Last heartbeat 8s cpu: 1.95G (4) mem: 0.00G host1 0.14 Last heartbeat 1s cpu: 1.95G (4) mem: 7.80G So, just out of pure speculation I could attribute this metric loss to all metrics being under value_threshold, but.... what about time_threshold? And why is the collector host holding onto its metrics while others lost theirs but keep the heartbeats? I feel confused, which is probably an indicator that I am missing something obvious... On Wed, Apr 27, 2011 at 9:56 PM, Bernard Li <bern...@vanhpc.org> wrote: > Hi Michael: > > You can try looking at the XML representation of the metric data from > each of your gmonds to figure out what's different between them. You > can accomplish this by doing: > > nc localhost 8649 (assuming you are using the default gmond port of 8649) > > This should spit out all the metric data of all hosts gmond is aware of. > > What's the output of `df -h` on both systems, do they look different? > > Cheers, > > Bernard > > On Wed, Apr 27, 2011 at 9:16 AM, Michael Bravo <mike.br...@gmail.com> wrote: >> More precisely, some metrics seem to be collected, and periodically >> sent, such as >> >> metric 'disk_free' being collected now >> Counting device /dev/root (6.21 %) >> For all disks: 142.835 GB total, 133.963 GB free for users. >> metric 'disk_free' has value_threshold 1.000000 >> metric 'part_max_used' being collected now >> Counting device /dev/root (6.21 %) >> For all disks: 142.835 GB total, 133.963 GB free for users. >> metric 'part_max_used' has value_threshold 1.000000 >> >> >> and then (I think around time_threshold expiration) >> >> sent message 'disk_free' of length 52 with 0 errors >> sent message 'part_max_used' of length 52 with 0 errors >> >> also, on startup all of these metrics seem to be prepared correctly: >> >> sending metadata for metric: disk_free >> sent message 'disk_free' of length 52 with 0 errors >> sending metadata for metric: part_max_used >> sent message 'part_max_used' of length 52 with 0 error >> >> etc and so on >> >> but none of these metrics appear in the node report at the web >> frontend, as I listed in original message >> >> where does the "Local disk: unknown" part coming from then? >> >> what is the most baffling, is that this problem host is completely >> identical to the one next to it, which has zero problems >> >> On Wed, Apr 27, 2011 at 7:30 PM, Michael Bravo <mike.br...@gmail.com> wrote: >>> I did try that, in non-daemonized mode, however there weren't any >>> evident errors popping up (and there's a lot of information coming up >>> that way), so perhaps I need an idea what to look for. >>> >>> On Wed, Apr 27, 2011 at 7:24 PM, Ron Cavallo <ron_cava...@s5a.com> wrote: >>>> Have you tried stating up gmond on the effected server with debug set to >>>> 10 in the gmond.conf? This may show some of the collection problems its >>>> having more specifically.... >>>> >>>> -RC >>>> >>>> >>>> Ron Cavallo >>>> Sr. Director, Infrastructure >>>> Saks Fifth Avenue / Saks Direct >>>> 12 East 49th Street >>>> New York, NY 10017 >>>> 212-451-3807 (O) >>>> 212-940-5079 (fax) >>>> 646-315-0119(C) >>>> www.saks.com >>>> >>>> >>>> -----Original Message----- >>>> From: Michael Bravo [mailto:mike.br...@gmail.com] >>>> Sent: Wednesday, April 27, 2011 11:14 AM >>>> To: ganglia-general >>>> Subject: [Ganglia-general] two identical hosts,one is having trouble >>>> with gmond >>>> >>>> Hello, >>>> >>>> here is a strange occurence. I have two (infact, more than two, but >>>> let's consider just a pair) identical servers running identical setups >>>> - identical OS, identical gmond with identical config files, identical >>>> disks, identical everything. However, one of those servers is perfectly >>>> well, and another one has trouble reporting default metrics. >>>> >>>> Here's what the "normal" one shows in node view: >>>> >>>> xx.xx.xx.172 >>>> >>>> Location: Unknown >>>> Cluster local time Wed Apr 27 19:05:32 2011 Last heartbeat received 5 >>>> seconds ago. >>>> Uptime 9 days, 9:22:38 >>>> Load: 0.00 0.00 0.00 >>>> 1m 5m 15m >>>> >>>> CPU Utilization: 0.1 0.2 99.7 >>>> user sys idle >>>> Hardware >>>> CPUs: 4 x 1.95 GHz >>>> Memory (RAM): 7.80 GB >>>> Local Disk: Using 16.532 of 142.835 GB >>>> Most Full Disk Partition: 11.6% used. Software >>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>> Booted: April 18, 2011, 9:42 am >>>> Uptime: 9 days, 9:22:38 >>>> Swap: Using 0.0 of 12001.6 MB swap. >>>> >>>> >>>> and here's what the "problem one" shows: >>>> >>>> xx.xx.xx.171 >>>> >>>> Location: Unknown >>>> Cluster local time Wed Apr 27 19:07:32 2011 Last heartbeat received 10 >>>> seconds ago. >>>> Uptime 9 days, 9:20:01 >>>> Load: 0.00 0.00 0.00 >>>> 1m 5m 15m >>>> >>>> CPU Utilization: 0.1 0.2 99.7 >>>> user sys idle >>>> Hardware >>>> CPUs: 4 x 1.95 GHz >>>> Memory (RAM): 7.80 GB >>>> Local Disk: Unknown >>>> Most Full Disk Partition: 6.2% used. Software >>>> OS: Linux 2.6.18-238.9.1.el5 (x86_64) >>>> Booted: April 18, 2011, 9:47 am >>>> Uptime: 9 days, 9:20:01 >>>> Swap: Using 12001.6 of 12001.6 MB swap. >>>> >>>> >>>> >>>> both are running gmond 3.1.7 and talk to a third host which also runs >>>> gmond 3.1.7 (which is getting polled by the web frontend host with >>>> gmetad 3.1.7) >>>> >>>> at a glance, there's something confusing gmond on the problem server, so >>>> it mismatches disk partitions, or something. >>>> >>>> as a result, the problem node reports not all of the default metrics, >>>> and those it does are somewhat off-kilter, as you can see (unknown local >>>> disk?) >>>> >>>> Any idea what might be going wrong and/or how to pinpoint the problem? >>>> >>>> -- >>>> Michael Bravo >>>> >>>> ------------------------------------------------------------------------ >>>> ------ >>>> WhatsUp Gold - Download Free Network Management Software The most >>>> intuitive, comprehensive, and cost-effective network management toolset >>>> available today. Delivers lowest initial acquisition cost and overall >>>> TCO of any competing solution. >>>> http://p.sf.net/sfu/whatsupgold-sd >>>> _______________________________________________ >>>> Ganglia-general mailing list >>>> Ganglia-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/ganglia-general >>>> >>> >> >> ------------------------------------------------------------------------------ >> WhatsUp Gold - Download Free Network Management Software >> The most intuitive, comprehensive, and cost-effective network >> management toolset available today. Delivers lowest initial >> acquisition cost and overall TCO of any competing solution. >> http://p.sf.net/sfu/whatsupgold-sd >> _______________________________________________ >> Ganglia-general mailing list >> Ganglia-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> > ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general