Hello fine people of the Ganglia community! I recently joined the team at a company with a long running Ganglia, set up many years ago by persons unknown to me, and not documented in any manner that we can find. It's still working, but the server it's on is in a failure state with some bad ram (we think) and we get missing bits of the graphs. Aside from this being annoying, we're a little bit afraid that if we powered the machine off, it may never come back on - as happens.
So, one of my tasks in recent weeks is to rebuild our Nagios and Ganglia setups. And I'm running into a wierd problem, which I will explain after a brief overview of how we use Ganglia - which isn't likely to change soon, for a number of reasons, though I have fielded some suggestions on how we may do things differently and why. So, there are about a hundred machines, ish, each running gmond configured to send data unicast to the collector. I've modified their configuration such that currently, each sends data to our existing host and to the new host. I am receiving at least some data for all machines, but I am missing quite a bit of data, esp load_one from almost everything, resulting in lots of broken images where I'd like to see graphs. These machines are split into clusters and grids with clusters in them, and it's .. well .. that's how it is. It looks something like this: SH Grid - Content Grid < (crawl, workflow - clusters) - Production Grid < (web, db, search, misc - clusters) - Dev QA (Cluster) - Corp Xen (Cluster) - Infrastructure (Cluster) So, on the collector host, there are three gmetad processes running: gmetad: SH Grid gmetad: Content Grid gmetad: Production Grid As well as numerous gmond: gmond: crawl gmond: workflow gmond: web gmond: db gmond: search gmond: misc gmond: dev/qa gmond: xen gmond: infra The configuration is exactly duplicated from the existing, "working" host, by which I mean that I am actually using the same configuration files. I was using gmetad 3.1 with gmond 3.0, but I decided that even though that should work and seemed not to be the problem, it wouldn't hurt to shore up the versions and am currently using both from 3.0. I have a few problems with this new setup: * Grids disappear and reappear sporadically - e.g. the Production grid is often not on the page, and today when I click through to production grid it takes me directly to web cluster because it is apparently not aware of any other clusters. * Wierd things happen - I know this is vague, but I'll lead with an example: when I click "Dev QA" sometimes it is reported as part of Production Grid, other times as part of Content Grid, when in fact it is a part of top-level "SH Grid". I'm sure there is other wierdness, but some of it may come into focus more if I get past these overwhelming problems. Thanks in advance for any help that any of you can offer! -- Best, Justin Alan Ryan - Linux System Administrator Simply Hired, Inc. 2513 E. Charleston, Suite 200 Mountain View, CA 94043 fax - 650.254.9001 cell - 415.321.0476 email - [email protected] http://www.SimplyHired.com ------------------------------------------------------------------------------ Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

