A setup which "solves" the update issue while maintaining a level of HA
is to have 2 (or more) unicast send channels from each node to a pair
(or more) of gmond aggregators and to have a multicast channel setup
between the aggregators themselves.
The cost is more network traffic, but it's pretty insignificant anyway.
Even on a 100Mb/s wire.
As for resending data on whenever a new node appears on the multicast
channel, I haven't looked at the code (it's far too late at night to do
that now), but I hope that wasn't implemented the way you describe...
Think of a cluster with a few thousand nodes on one channel (not the
best idea probably probably) - a new node shows up and everyone starts
coughing their data on the wire. Add a really low dmax value and a node
rebooting every few minutes (or a bad wire/switch/NIC) and you have a
lovely little mess.
Alex
Jason A. Smith wrote:
> On Thu, 2006-03-23 at 15:47 -0800, Chuck Simmons wrote:
>
>> Alex --
>>
>> Thanks for the details. Telneting to a gmond XML port to dump
>> internal state is a nice debugging technique.
>>
>> One of my problems is that I'm running a secondary daemon using the
>> gmetric subroutine libraries, and it took me awhile to realize that
>> daemon is in some ways equivalent to 'gmond'. In particular, I have
>> to reboot it in addition to 'gmond'. The problem was immediately
>> obvious once I used the telnet trick you mentioned.
>>
>
> Metrics also have a dmax attribute that should force their removal from
> memory once expired, but I don't remember if this is actually
> implemented or not.
>
>
>> So for the missing cpu data issue... Let me write down what's
>> happening real slowly to make sure I understand. I'm running a
>> multicast gmond on each cluster to aggregate data, implying that each
>> node of the cluster eventually aggregates data about all other nodes
>> of the same cluster. I'm using a centralized gmetad to pull data from
>> a node of each cluster. Presumably 'gmetad' doesn't really remember a
>> whole lot about the outlying nodes.
>>
>
> I am not really sure what you mean here, but gmetad basically keeps info
> about all nodes in each cluster in memory, similar to how gmond keeps
> info about all nodes in its cluster in memory. Just like gmond, gmetad
> also respects the dmax attributes. If you don't have dmax set or don't
> want to wait that long then you will have to restart gmetad also.
>
>
>> I go out to the cluster and kill gmond on each node. Then I go
>> through the nodes and start gmond back up on each node. As each node
>> starts, it broadcasts number of cpus throughout the cluster. Thus,
>> when I'm done restarting, one of the nodes (the first to restart)
>> knows how many cpus each node has, but nodes that were restarted last
>> don't have complete state information.
>>
>
> Not exactly true, see below.
>
>
>> When I then restart 'gmetad' at the central location, it connects up
>> to one of the nodes in the cluster, and if that node doesn't have full
>> state informatin, gmetad incorrectly reports the number of cpus in the
>> cluster. [Since I am using a background process that gathers metrics
>> separately from 'gmond' relatively frequently, this background process
>> is probably causing all nodes in the cluster to know about all of the
>> hosts in the cluster if not all of the metrics of all of the hosts in
>> the cluster.]
>> This will eventually correct itself since all metrics are
>> periodically rebroadcast.
>> Possible alternate fixes may include:
>> (1) When a node receives a broadcast from another node that it
>> hasn't seen before, it may want to send its data back to the first
>> node. If I start node A and it broadcasts to an empty cluster, then I
>> start node B and it broadcasts to A, then it might be nice if node A
>> sends data back to B because it can reasonably infer that B doesn't
>> have A's state and that B should have A's state.
>>
>
> I haven't checked the gmond sources lately, but this is exactly what it
> was designed to do. Anytime gmond sees data from a new node that is
> hasn't seen before, it assumes that node doesn't know anything about
> itself either, and sends a complete set of its own metrics out on the
> multicast address. This can actually cause part of the problem,
> especially if you restart gmond on a lot of nodes all at the same time,
> basically because multicast is udp based and therefore does not have
> guaranteed packet delivery. I think during this burst of udp metrics
> from many nodes, some get lost and you will just have to wait till they
> are resent later.
>
>
>> (2) maybe daemons that gather metrics should not directly
>> broadcast them throughout a cluster. Instead the metrics should be
>> accumulated within a central daemon and then be broadcast. (In other
>> words, treat 'gmond' as having two separate components: a metrics
>> gathering component and a metric/cluster aggregation component. Then
>> both the metrics component of 'gmond' and the metrics that I am
>> gathering should be handed to the aggregation component.) [This is
>> probably not useful without also implementing (1) above.]
>> (3) Alex implies that there may be alternate ways to
>> configure a cluster without using multicasting which may handle some
>> or all aspects of this problem.
>>
>
> You can configure gmond to use unicast if you don't need or care about
> the HA feature that multicast gives you.
>
>
>> [We can treat each node as maintaining a list of metrics and
>> their current values and broadcasting deltas to that list on a
>> periodic basis. In the current system, it is possible to recieve a
>> delta without having the background data to which the delta applies.
>> Multiple daemons each spitting out deltas to their own metrics is
>> compatible with the current model. However, we may want to have all
>> the background data in a single list; we may also want each node to
>> know which metric gathering daemons exist so that we can better report
>> when one of the metric gathering daemons dies.]
>>
>> Moving on to the issue of correcting configuration problems. While we
>> can say that having a timeout is the way to correct configuration
>> issues, this is not necessarily the best implementation. Part of my
>> problem is that I have multiple daemons that gather and broadcast
>> metrics. If we address parts of that as discussed above, then it
>> becomes easier to fix the broadcast address by just resetting a single
>> daemon.
>>
>
> There was a plan to provide a plugin architecture for writing custom
> metrics in ganglia, I am not sure what happened to that though.
>
>
>> So, at the current time, we can configure the system in a couple
>> of ways. We can configure the system so that a host is considered
>> removed from a cluster when the host has been down sufficiently long,
>> or we can manually remove the host from the cluster by restarting all
>> gmond daemons in the cluster.
>> Possible alternate approaches might include providing a command
>> that could be sent to a 'gmond' daemon in a cluster to remove a host
>> from the cluster. It may be that there already exist mechanisms to
>> restart all gmond daemons in a cluster, but this mechanism is not
>> integrated into ganglia.
>>
>> So, thanks, I think I now understand what's going on.
>>
>> Cheers, Chuck
>>
>>
>>
>> Alex Balk wrote:
>>
>>> Hi Chuck,
>>>
>>>
>>> See below...
>>>
>>>
>>>
>>> Chuck Simmons wrote:
>>>
>>>
>>>
>>>> The number of cpus does get sorted out, but I don't believe that
>>>> restarting 'gmond' is a solution. The problem occurs after restarting
>>>> a number of 'gmond' processes, and the problem is caused because
>>>> 'gmond' is not reporting the information. Does 'gmond' maintain a
>>>> timestamp on disk as to when it last reported the number of cpus and
>>>> insist on waiting sufficiently long to report again? Does the
>>>> collective distributed memory of the system remember when the number
>>>> of cpus was last reported but not remember what the last reported
>>>> value was? Is there any chance that anyone can give me hints to how
>>>> the code works without me having to read the code and reverse engineer
>>>> the intent?
>>>>
>>>>
>>>>
>>> The reporting interval for number of CPUs is defined within /etc/gmond.conf.
>>> For example:
>>>
>>> collection_group {
>>> collect_once = yes
>>> time_threshold = 1800
>>> metric {
>>> name = "cpu_num"
>>> }
>>>
>>> The above defines that the number of CPUs is collected once at the
>>> startup of gmond and reported every 1800 seconds.
>>> Your problem occurs because gmond doesn't save any data on disk, but
>>> rather in memory. This means that if you're using a single gmond
>>> aggregator (in unicast mode) and that aggregator gets restarted, it will
>>> will not receive another report the number of CPUs till 1800 seconds
>>> elapsed since the previous report.
>>> The case of multicast is a more interesting one, since every node holds
>>> data for all nodes on the multicast channel. The question here is
>>> whether an update with a newer timestamp overrides all previous XML data
>>> for the host. I don't think that's the case, it seems more likely that
>>> only existing data is overwritten... but then, I don't use multicast, so
>>> you may qualify this answer as throwing useless, obvious crap your way.
>>>
>>> Generally speaking, there are 2 cases when a host reports a metric via
>>> its send_channel:
>>> 1. When a time_threshold expires.
>>> 2. When a value_threshold is exceeded.
>>>
>>> You're welcome to read the code for more insight, but a simple telnet to
>>> a predefined TCP channel would probably be quicker. You could just look
>>> at the XML data and compare pre-update and post-update values (yes,
>>> you'll need to take note of the timestamps - again, in the XML).
>>>
>>>
>>>
>>>> I understand that I can group nodes via /etc/gmond.conf. The question
>>>> is, once I have screwed up the configuration, how do I recover from
>>>> that screw up? I have restarted various gmetad's and various
>>>> gmond's. The grouping is still incorrect. Exactly which gmetad's and
>>>> gmond's do I have to shut down when. And, again, my real question is
>>>> about understanding how the code works -- how the distributed memory
>>>> works.
>>>>
>>>>
>>>>
>>> As far as I know, you cannot recover from a configuration error unless
>>> you've made sure host_dmax was set to a fairly small, non-zero value.
>>>
>>> From the docs:
>>>
>>> The host_dmax value is an integer with units in seconds. When set to
>>> zero (0), gmond will never delete a host from its list even when a
>>> remote host has stopped responding. If host_dmax is set to a positive
>>> number then gmond will flush a host after it has not heard from it for
>>> host_dmax seconds. By the way, dmax means ``delete max''.
>>>
>>> This way, once a host's configuration was modified to point at a
>>> different send channel, the aggregator(s) on its previous channel will
>>> forget about its existence once delete_max expires.
>>>
>>> Personally, I don't use multicast due to various reasons, the main one
>>> actually being its main advantage - every node keeps data on the entire
>>> cluster. While this provides for maximal high availability, it also has
>>> a bigger memory footprint. Especially when you have a few thousands of
>>> nodes.
>>>
>>>
>>>
>>>> I'd much rather be ignored than have people try to pawn off facile
>>>> answers on me.
>>>>
>>>>
>>>>
>>> I'd provide you with more information on a possible setup which balances
>>> high availability with performance, but I wouldn't want to overflow you
>>> with useless data any more than I've done so far.
>>> Let me know if you'd like more information.
>>>
>>> Cheers,
>>> Alex
>>>
>>>
>>>
>>>> Cheers, Chuck
>>>>
>>>>
>>>>
>>>> Bernard Li wrote:
>>>>
>>>>
>>>>> Hi Chuck:
>>>>>
>>>>> For the first issue - give it time, it should sort itself out.
>>>>> Alternatively, you can find out which node is reporting incorrect
>>>>> information, and restart gmond on it.
>>>>>
>>>>> For the second issue, you can group nodes in different data_source
>>>>> via the multicast port in /etc/gmond.conf. Use the same port # for
>>>>> nodes you want belonging to the same group.
>>>>>
>>>>> You'll need to restart gmetad and gmond for the new groupings to take
>>>>> effect.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Bernard
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* [EMAIL PROTECTED] on behalf of
>>>>> Chuck Simmons
>>>>> *Sent:* Wed 22/03/2006 17:54
>>>>> *To:* [email protected]
>>>>> *Subject:* [Ganglia-developers] reorganizing clusters
>>>>>
>>>>> I need help understanding two things.
>>>>>
>>>>> I currently have a grid. One of the clusters in the grid is named
>>>>> "staiu" and the "grid" level web page reports that this has 8 hosts
>>>>> containing 4 cpus. In actuality, this has 8 hosts each containing 4
>>>>> cpus, but apparently the hosts are not reporting the current number of
>>>>> cpus to the front end. Why not? I recently restarted gmond on each of
>>>>> the 8 hosts.
>>>>>
>>>>> Another cluster is named "staqp05-08" and the "grid" level web page
>>>>> reports that this has 12 hosts. In actual fact, it only has 4 hosts.
>>>>> The extra 8 hosts are the 8 hosts of the 'staiu' cluster. On the
>>>>> cluster level page for staqp05-08, the "choose a node" pull down menu
>>>>> lists the 8 staiu hosts, and the "hosts up" number contains the staiu
>>>>> hosts, and there are undrawn graphs for each of the staiu hosts in the
>>>>> "load one" section. What do I have to do so that the web pages or gmond
>>>>> daemons or whatever won't think that the staqp cluster contains the
>>>>> staiu hosts? What is the specific mechanism that causes this
>>>>> association to persist despite having shutdown all staqp gmond daemons
>>>>> and both the gmond and gmetad daemons on the web server, simultaneously,
>>>>> and then starting up that collection of daemons?
>>>>>
>>>>> Thanks, Chuck
>>>>>
>>>>>
>>>>> -------------------------------------------------------
>>>>> This SF.Net email is sponsored by xPML, a groundbreaking scripting
>>>>> language
>>>>> that extends applications into web and mobile media. Attend the live
>>>>> webcast
>>>>> and join the prime developer group breaking into this new coding
>>>>> territory!
>>>>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>>>> <http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642>
>>>>> _______________________________________________
>>>>> Ganglia-developers mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>>>>
>>>>>
>>>>>