Re: [Ganglia-general] [Ganglia-developers] can't get cpu_num toshow for whole cluster

Brad Nicholes Mon, 15 Sep 2008 14:15:53 -0700

>>> On 9/12/2008 at 11:48 AM, in message
<[EMAIL PROTECTED]>, "Bernard
Li"
<[EMAIL PROTECTED]> wrote:
> Hi all:
> 
> On Fri, Sep 12, 2008 at 10:00 AM, Ofer Inbar <[EMAIL PROTECTED]> wrote:
> 
>> I added a host to an existing cluster, and noticed the total number
of
>> CPU cores for the cluster fluctuate, so I tried restarting all the
>> gmond's in the cluster... but that just made most of the CPU's
appear
>> to disappear from Ganglia metrics altogether.
>>
>> I narrowed it down to this:
>>  each gmond only reports cpu_num for nodes that restarted after it.
>>
>> If I restart gmond on node1, it reports cpu_num for itself only,
even
>> though other gmonds in the cluster are reporting cpu_num for other
>> nodes.  If I restart gmond on node2, node1 will now report cpu_num
for
>> itself and for node2 ... but node2 has now "forgotten" cpu_num for
all
>> other nodes except itself.  And so on.
>>
>> It's a catch-22.  I can't make them all see every node's metrics.
>>
>> Ganglia 3.1.0 on CentOS, using multicast only.
> 
> Since I have a 3.0.7 installation handy, Cos suggested I do an
experiment:
> 
> 1) WIth Ganglia 3.1.1, I restart a gmond that is not listed in the
> data_source and I nc the host checking for the number of cpu_num
lines
> in the XML stream.  This number stays 1 until quite a while (maybe
10
> mins).
> 2) With Ganglia 3.0.7, I did the same test as above, and immediately
> after restarting gmond the number of cpu_num lines was already at 3,
> and quite quickly grows in a matter of minutes
> 
> When I first tried 3.1.x, I always thought it odd that when I
restart
> a gmond, I had to restart *all* the rest of the gmonds to get the
> right number of total cpus, I guess this confirms my suspicion.
> 
> If this is indeed a bug/unwanted new behaviour, please discuss this
in
> ganglia-developers.
>


I am wondering if this might be an issue with the way that the metadata
for a metric is being sent.  The unique attribute about this is that
cpu_num is a collect_once metric.  This means that if the data value is
sent but one of the gmond's in the cluster has not received the metadata
yet, the value may get ignored when the XML is written.  One interesting
test to try to validate this theory would be to set the
send_metadata_interval to something greater than zero even in a
multicast environment.  Then run your same test and see if the same
problem shows up or goes away.  If the problem goes away, then we might
have to rework how the metadata data is being requested and sent in a
multicast environment.

Brad

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] [Ganglia-developers] can't get cpu_num toshow for whole cluster

Reply via email to