>>> On 9/12/2008 at 11:48 AM, in message <[EMAIL PROTECTED]>, "Bernard Li" <[EMAIL PROTECTED]> wrote: > Hi all: > > On Fri, Sep 12, 2008 at 10:00 AM, Ofer Inbar <[EMAIL PROTECTED]> wrote: > >> I added a host to an existing cluster, and noticed the total number of >> CPU cores for the cluster fluctuate, so I tried restarting all the >> gmond's in the cluster... but that just made most of the CPU's appear >> to disappear from Ganglia metrics altogether. >> >> I narrowed it down to this: >> each gmond only reports cpu_num for nodes that restarted after it. >> >> If I restart gmond on node1, it reports cpu_num for itself only, even >> though other gmonds in the cluster are reporting cpu_num for other >> nodes. If I restart gmond on node2, node1 will now report cpu_num for >> itself and for node2 ... but node2 has now "forgotten" cpu_num for all >> other nodes except itself. And so on. >> >> It's a catch-22. I can't make them all see every node's metrics. >> >> Ganglia 3.1.0 on CentOS, using multicast only. > > Since I have a 3.0.7 installation handy, Cos suggested I do an experiment: > > 1) WIth Ganglia 3.1.1, I restart a gmond that is not listed in the > data_source and I nc the host checking for the number of cpu_num lines > in the XML stream. This number stays 1 until quite a while (maybe 10 > mins). > 2) With Ganglia 3.0.7, I did the same test as above, and immediately > after restarting gmond the number of cpu_num lines was already at 3, > and quite quickly grows in a matter of minutes > > When I first tried 3.1.x, I always thought it odd that when I restart > a gmond, I had to restart *all* the rest of the gmonds to get the > right number of total cpus, I guess this confirms my suspicion. > > If this is indeed a bug/unwanted new behaviour, please discuss this in > ganglia-developers. >
I am wondering if this might be an issue with the way that the metadata for a metric is being sent. The unique attribute about this is that cpu_num is a collect_once metric. This means that if the data value is sent but one of the gmond's in the cluster has not received the metadata yet, the value may get ignored when the XML is written. One interesting test to try to validate this theory would be to set the send_metadata_interval to something greater than zero even in a multicast environment. Then run your same test and see if the same problem shows up or goes away. If the problem goes away, then we might have to rework how the metadata data is being requested and sent in a multicast environment. Brad ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general