Re: [Ganglia-developers] reorganizing clusters

Chuck Simmons Thu, 23 Mar 2006 16:34:08 -0800

Thanks for the clarifications.

"


       (1) When a node receives a broadcast from another node that it

hasn't seen before, it may want to send its data back to the first
node.  If I start node A and it broadcasts to an empty cluster, then I
start node B and it broadcasts to A, then it might be nice if node A
sends data back to B because it can reasonably infer that B doesn't
have A's state and that B should have A's state.



I haven't checked the gmond sources lately, but this is exactly what it
was designed to do.  Anytime gmond sees data from a new node that is
hasn't seen before, it assumes that node doesn't know anything about
itself either, and sends a complete set of its own metrics out on the
multicast address.  This can actually cause part of the problem,
especially if you restart gmond on a lot of nodes all at the same time,
basically because multicast is udp based and therefore does not have
guaranteed packet delivery.  I think during this burst of udp metrics
from many nodes, some get lost and you will just have to wait till they
are resent later.
"

I have 4 nodes and am slowly starting them up one-by-one manually over a 
reasonably fast unloaded network, so UDP non-delivery is unlikely to be 
occuring.  I don't see this functionality in my initial glance at the code.  
I'll dig around some more.

Thanks, Chuck




Jason A. Smith wrote:

On Thu, 2006-03-23 at 15:47 -0800, Chuck Simmons wrote:

Alex --

Thanks for the details.  Telneting to a gmond XML port to dump
internal state is a nice debugging technique.

One of my problems is that I'm running a secondary daemon using the
gmetric subroutine libraries, and it took me awhile to realize that
daemon is in some ways equivalent to 'gmond'.  In particular, I have
to reboot it in addition to 'gmond'.  The problem was immediately
obvious once I used the telnet trick you mentioned.


Metrics also have a dmax attribute that should force their removal from
memory once expired, but I don't remember if this is actually
implemented or not.

So for the missing cpu data issue...  Let me write down what's
happening real slowly to make sure I understand.  I'm running a
multicast gmond on each cluster to aggregate data, implying that each
node of the cluster eventually aggregates data about all other nodes
of the same cluster.  I'm using a centralized gmetad to pull data from
a node of each cluster.  Presumably 'gmetad' doesn't really remember a
whole lot about the outlying nodes.


I am not really sure what you mean here, but gmetad basically keeps info
about all nodes in each cluster in memory, similar to how gmond keeps
info about all nodes in its cluster in memory.  Just like gmond, gmetad
also respects the dmax attributes.  If you don't have dmax set or don't
want to wait that long then you will have to restart gmetad also.

   I go out to the cluster and kill gmond on each node.  Then I go
through the nodes and start gmond back up on each node.  As each node
starts, it broadcasts number of cpus throughout the cluster.  Thus,
when I'm done restarting, one of the nodes (the first to restart)
knows how many cpus each node has, but nodes that were restarted last
don't have complete state information.


Not exactly true, see below.

When I then restart 'gmetad' at the central location, it connects up
to one of the nodes in the cluster, and if that node doesn't have full
state informatin, gmetad incorrectly reports the number of cpus in the
cluster.  [Since I am using a background process that gathers metrics
separately from 'gmond' relatively frequently, this background process
is probably causing all nodes in the cluster to know about all of the
hosts in the cluster if not all of the metrics of all of the hosts in
the cluster.]
   This will eventually correct itself since all metrics are
periodically rebroadcast.
   Possible alternate fixes may include:
       (1) When a node receives a broadcast from another node that it
hasn't seen before, it may want to send its data back to the first
node.  If I start node A and it broadcasts to an empty cluster, then I
start node B and it broadcasts to A, then it might be nice if node A
sends data back to B because it can reasonably infer that B doesn't
have A's state and that B should have A's state.


I haven't checked the gmond sources lately, but this is exactly what it
was designed to do.  Anytime gmond sees data from a new node that is
hasn't seen before, it assumes that node doesn't know anything about
itself either, and sends a complete set of its own metrics out on the
multicast address.  This can actually cause part of the problem,
especially if you restart gmond on a lot of nodes all at the same time,
basically because multicast is udp based and therefore does not have
guaranteed packet delivery.  I think during this burst of udp metrics
from many nodes, some get lost and you will just have to wait till they
are resent later.

       (2) maybe daemons that gather metrics should not directly
broadcast them throughout a cluster.  Instead the metrics should be
accumulated within a central daemon and then be broadcast.  (In other
words, treat 'gmond' as having two separate components:  a metrics
gathering component and a metric/cluster aggregation component.  Then
both the metrics component of 'gmond' and the metrics that I am
gathering should be handed to the aggregation component.)  [This is
probably not useful without also implementing (1) above.]
       (3)  Alex implies that there may be alternate ways to
configure a cluster without using multicasting which may handle some
or all aspects of this problem.


You can configure gmond to use unicast if you don't need or care about
the HA feature that multicast gives you.

      [We can treat each node as maintaining a list of metrics and
their current values and broadcasting deltas to that list on a
periodic basis.  In the current system, it is possible to recieve a
delta without having the background data to which the delta applies.
Multiple daemons each spitting out deltas to their own metrics is
compatible with the current model.  However, we may want to have all
the background data in a single list; we may also want each node to
know which metric gathering daemons exist so that we can better report
when one of the metric gathering daemons dies.]

Moving on to the issue of correcting configuration problems.  While we
can say that having a timeout is the way to correct configuration
issues, this is not necessarily the best implementation.  Part of my
problem is that I have multiple daemons that gather and broadcast
metrics.  If we address parts of that as discussed above, then it
becomes easier to fix the broadcast address by just resetting a single
daemon.


There was a plan to provide a plugin architecture for writing custom
metrics in ganglia, I am not sure what happened to that though.

   So, at the current time, we can configure the system in a couple
of ways.  We can configure the system so that a host is considered
removed from a cluster when the host has been down sufficiently long,
or we can manually remove the host from the cluster by restarting all
gmond daemons in the cluster.
   Possible alternate approaches might include providing a command
that could be sent to a 'gmond' daemon in a cluster to remove a host
from the cluster.  It may be that there already exist mechanisms to
restart all gmond daemons in a cluster, but this mechanism is not

integrated into ganglia.

So, thanks, I think I now understand what's going on.

Cheers, Chuck

Alex Balk wrote:

Hi Chuck,


See below...



Chuck Simmons wrote:

The number of cpus does get sorted out, but I don't believe that
restarting 'gmond' is a solution.  The problem occurs after restarting
a number of 'gmond' processes, and the problem is caused because
'gmond' is not reporting the information.  Does 'gmond' maintain a
timestamp on disk as to when it last reported the number of cpus and
insist on waiting sufficiently long to report again?  Does the
collective distributed memory of the system remember when the number
of cpus was last reported but not remember what the last reported
value was?  Is there any chance that anyone can give me hints to how
the code works without me having to read the code and reverse engineer
the intent?

The reporting interval for number of CPUs is defined within /etc/gmond.conf.
For example:

 collection_group {
   collect_once   = yes
   time_threshold = 1800
   metric {
    name = "cpu_num"
   }

The above defines that the number of CPUs is collected once at the
startup of gmond and reported every 1800 seconds.
Your problem occurs because gmond doesn't save any data on disk, but
rather in memory. This means that if you're using a single gmond
aggregator (in unicast mode) and that aggregator gets restarted, it will
will not receive another report the number of CPUs till 1800 seconds
elapsed since the previous report.
The case of multicast is a more interesting one, since every node holds
data for all nodes on the multicast channel. The question here is
whether an update with a newer timestamp overrides all previous XML data
for the host. I don't think that's the case, it seems more likely that
only existing data is overwritten... but then, I don't use multicast, so
you may qualify this answer as throwing useless, obvious crap your way.

Generally speaking, there are 2 cases when a host reports a metric via
its send_channel:
1. When a time_threshold expires.
2. When a value_threshold is exceeded.

You're welcome to read the code for more insight, but a simple telnet to
a predefined TCP channel would probably be quicker. You could just look
at the XML data and compare pre-update and post-update values (yes,
you'll need to take note of the timestamps - again, in the XML).

I understand that I can group nodes via /etc/gmond.conf.  The question
is, once I have screwed up the configuration, how do I recover from
that screw up?  I have restarted various gmetad's and various
gmond's.  The grouping is still incorrect.  Exactly which gmetad's and
gmond's do I have to shut down when.  And, again, my real question is
about understanding how the code works -- how the distributed memory
works.

As far as I know, you cannot recover from a configuration error unless
you've made sure host_dmax was set to a fairly small, non-zero value.

From the docs:

  The host_dmax value is an integer with units in seconds. When set to
  zero (0), gmond will never delete a host from its list even when a
  remote host has stopped responding. If host_dmax is set to a positive
  number then gmond will flush a host after it has not heard from it for
  host_dmax seconds. By the way, dmax means ``delete max''.

This way, once a host's configuration was modified to point at a
different send channel, the aggregator(s) on its previous channel will
forget about its existence once delete_max expires.

Personally, I don't use multicast due to various reasons, the main one
actually being its main advantage - every node keeps data on the entire
cluster. While this provides for maximal high availability, it also has
a bigger memory footprint. Especially when you have a few thousands of
nodes.

I'd much rather be ignored than have people try to pawn off facile
answers on me.

I'd provide you with more information on a possible setup which balances
high availability with performance, but I wouldn't want to overflow you
with useless data any more than I've done so far.
Let me know if you'd like more information.

Cheers,
Alex

Cheers, Chuck



Bernard Li wrote:

Hi Chuck:

For the first issue - give it time, it should sort itself out.Alternatively, you can find out which node is reporting incorrect

information, and restart gmond on it.

For the second issue, you can group nodes in different data_source
via the multicast port in /etc/gmond.conf.  Use the same port # for
nodes you want belonging to the same group.

You'll need to restart gmetad and gmond for the new groupings to take
effect.

Cheers,

Bernard

------------------------------------------------------------------------
*From:* [EMAIL PROTECTED] on behalf of
Chuck Simmons
*Sent:* Wed 22/03/2006 17:54
*To:* [email protected]
*Subject:* [Ganglia-developers] reorganizing clusters

I need help understanding two things.

I currently have a grid.  One of the clusters in the grid is named
"staiu" and the "grid" level web page reports that this has 8 hosts
containing 4 cpus.  In actuality, this has 8 hosts each containing 4
cpus, but apparently the hosts are not reporting the current number of
cpus to the front end.  Why not?  I recently restarted gmond on each of
the 8 hosts.

Another cluster is named "staqp05-08" and the "grid" level web page

reports that this has 12 hosts. In actual fact, it only has 4 hosts.The extra 8 hosts are the 8 hosts of the 'staiu' cluster. On the

cluster level page for staqp05-08, the "choose a node" pull down menu
lists the 8 staiu hosts, and the "hosts up" number contains the staiu
hosts, and there are undrawn graphs for each of the staiu hosts in the
"load one" section.  What do I have to do so that the web pages or gmond
daemons or whatever won't think that the staqp cluster contains the
staiu hosts?  What is the specific mechanism that causes this
association to persist despite having shutdown all staqp gmond daemons
and both the gmond and gmetad daemons on the web server, simultaneously,
and then starting up that collection of daemons?

Thanks, Chuck


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the live
webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
<http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642>
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Re: [Ganglia-developers] reorganizing clusters

Reply via email to