On Mon, 21 Dec 2009, Spike Spiegel wrote: >> a. Get all the rrds (rsync) from gmetad2 before you restart gmetad1 > which unless you have small amount or data or fast network between the > two nodes won't complete before the next write is initiated, meaning > they won't be identical.
Granted they will never be identical. Even on fastest networks there will be a window of data lost. On fast networks/smaller # of nodes it will be small. On bigger networks a larger window etc. > how do you tell which one has most up to date data? This is in no respect an automatic processes (even though if I really wanted to I could Point proxy to your primary node. If it fails point to secondary or tertiary. > if you really mean "most recent" then both would, because both would > have fetched the last reading assuming they are both functional, but > gmetad1 would have a hole in its graphs. To me that does not really > count as up to date. Up to date would be the one with the most > complete data set which you have no way to identify programmatically. > > Also, assume now gmetad2 fails and both have holes, which one is the > most up to date? That is up to you decide. This is in no way perfect. > I guess it does if I look at it from your perspective which if I > understood it correctly implies that: > * some data loss doesn't matter > * manual interaction to fix things is ok > > But that isn't my perspective. Scalable (distributed) applications > should be able to guarantee by design no data loss in as many cases as > possible and not force you to centralized designs or hackery in order > to do so. > > There are ways to make this possible without changes to the current > gmetad code by adding a helper webservice that proxies the access to > rrd. This way it's perfectly fine to have different locations with > different data and the webservice will take care of interrogating one > or more gmetads/backends to retrieve the full set and present it to > the user. Fully distributed, no data loss. This could be of course > built into gmetad by making something like port 8652 access the rrds, > but to me that's the wrong path, makes gmetad's code more complicated > and it's potentially a functionality that has nothing to do with > ganglia and is backend dependent. The issue is value of this data. If these were financial transactions than no loss would be acceptable however these are not. They are performance, trending data which get "averaged" down as time goes by so loss of couple hours or even days of data is not tragic. I have also seen many projects where we tried to avoid a particular "edge" case and in the process introduced a whole lot of new issue that were worse than the problem we started with. To this point I have ran removespikes.pl on RRDs numerous times to remove spikes in Ganglia data and in most cases it has worked yet in couple cases it ended up corrupting RRD files so that they couldn't be used by gmetad. Therefore I can reasonably forsee something like that happening in your implementation. Also I have seen in the past bugs (I remember a multicast bug we reported years ago) going unaddressed due to what I can only interpret lack of resources. So if you weigh all the possibilities of things going wrong (and a lot can) and the resources available I'd say you are asking for trouble. Vladimir ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Ganglia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-developers
