Re: [Ganglia-general] Reg issues in failover logic of ganglia

Carlo Marcelo Arenas Belon Fri, 28 Sep 2007 00:10:14 -0700

On Thu, Sep 27, 2007 at 01:12:23AM -0600, Natraj Muthukrishnan wrote:
> 1) What if the node which we have specified in the gmetad as the cluster 
> address goes down?


then gmetad fails to pull the data for that gmond and tries the next collector
configured for that cluster if there was one in the configuration.

> Ideally gmetad should be contacting the other nodes within this cluster.But
this does not seem to happen.

gmetad doesn't know which nodes are the collectors for that cluster, that is
what the configuration for data_source is meant to tell it.

even assuming that gmetad has from historical perspective a list of all nodes
in the cluster from the last time it polled the collector, he doesn't know
which one of them would have the full cluster view he needs (multicast keeps a
copy of all the cluster metrics in all nodes, but if you are using unicast,
then only some gmond would have all the information for the cluster collected,
and there is no way for gmetad to know which one of them is that one, except 
for the configuration that instructs it to look for some of them specifically
for that information)

> It shows the entire cluster to be down. Even though the other node within 
> that cluster is up and running. The architecture document says we have to 
> specify multiple addresses in the gmetad.conf cluster section for failover. 

or you can get a load balanced VIP that connects to as many collectors you
need for redundancy behind it (in different switches, power circuits, racks,
whatever it is that you are trying to protect against)

> Now I have a use case for this. What if we have say 1000 nodes in one cluster 
> and then we want perfect failover for this cluster? Do we need to add all the 
> node address in the gmetad.conf file cluster section for gmetad to contact 
> that node in case the main cluster node is down?

hopefully you only have 2 nodes, if you have more you probably want to be a
little more selective about how many you would add to the gmetad
configuration, as it is obvious that polling each one of them would take time
and going through a 1000 list linearly might not be the best approach
possible.

also (although it might be obvious), you have to consider the cost/risk of
losing the collector (which will only prevent you to get updates into gmetad
until it is restored) and assuming there is no other backup collector you can
use.

since you ask for perfection, you probably want a pair of load balancers
connected to redundant switches, each one with a VIP that connects to as many
collectors as needed depending on how many racks, circuits, PDUs, switches,
datacenters your cluster span and 2 entries in gmetad pointing to those 2
VIPs.

> 2) The next thing comes across as a bug to me. Pls let me know if my 
> observation is correct on this one.

most likely incorrect, otherwise it would had been filed as a bug already and 
most likely fixed by now.

> When we specify two machine ips in the cluster tab in the gmetad.conf file , 
> in the browser when we go to that cluster we see 2 machines . One with the ip 
> address as the name and the other with hostname of the second machine in the 
> list.

that sounds like some confusion created by the fact that you are using ips
instead of hostnames, and there are different views between hosts on what the
correlation between IP/hostname should be.

that wouldn't be a problem of course if your setup is done correctly and you
have a working DNS that maps correctly the direct as well as the reverse zone
for your cluster and makes it universally accessible in all nodes.

Carlo

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Reg issues in failover logic of ganglia

Reply via email to