On Mon, 21 Dec 2009, Spike Spiegel wrote:

>> a. Get all the rrds (rsync) from gmetad2 before you restart gmetad1
> which unless you have small amount or data or fast network between the
> two nodes won't complete before the next write is initiated, meaning
> they won't be identical.


Granted they will never be identical. Even on fastest networks there will 
be a window of data lost. On fast networks/smaller # of nodes it will be 
small. On bigger networks a larger window etc.


> how do you tell which one has most up to date data?


This is in no respect an automatic processes (even though if I really 
wanted to I could Point proxy to your primary node. If it fails point to 
secondary or tertiary.


> if you really mean "most recent" then both would, because both would
> have fetched the last reading assuming they are both functional, but
> gmetad1 would have a hole in its graphs. To me that does not really
> count as up to date. Up to date would be the one with the most
> complete data set which you have no way to identify programmatically.
>
> Also, assume now gmetad2 fails and both have holes, which one is the
> most up to date?

That is up to you decide. This is in no way perfect.


> I guess it does if I look at it from your perspective which if I
> understood it correctly implies that:
> * some data loss doesn't matter
> * manual interaction to fix things is ok
>
> But that isn't my perspective. Scalable (distributed) applications
> should be able to guarantee by design no data loss in as many cases as
> possible and not force you to centralized designs or hackery in order
> to do so.
>
> There are ways to make this possible without changes to the current
> gmetad code by adding a helper webservice that proxies the access to
> rrd. This way it's perfectly fine to have different locations with
> different data and the webservice will take care of interrogating one
> or more gmetads/backends to retrieve the full set and present it to
> the user. Fully distributed, no data loss. This could be of course
> built into gmetad by making something like port 8652 access the rrds,
> but to me that's the wrong path, makes gmetad's code more complicated
> and it's potentially a functionality that has nothing to do with
> ganglia and is backend dependent.


The issue is value of this data. If these were financial transactions than 
no loss would be acceptable however these are not. They are performance, 
trending data which get "averaged" down as time goes by so loss of couple 
hours or even days of data is not tragic.

I have also seen many projects where we tried to avoid a particular "edge" 
case and in the process introduced a whole lot of new issue that were 
worse than the problem we started with. To this point I have ran 
removespikes.pl on RRDs numerous times to remove spikes in Ganglia data 
and in most cases it has worked yet in couple cases it ended up corrupting 
RRD files so that they couldn't be used by gmetad. Therefore I can 
reasonably forsee something like that happening in your implementation. 
Also I have seen in the past bugs (I remember a multicast bug we reported 
years ago) going unaddressed due to what I can only interpret lack of 
resources.

So if you weigh all the possibilities of things going wrong (and a lot 
can) and the resources available I'd say you are asking for trouble.

Vladimir

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to