On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote: > >>> On 8/30/2008 at 12:25 AM, in message <[EMAIL PROTECTED]>, Carlo > Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote: > > On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote: > >> Should this have made it into 3.1, or 3.1.1? It > >> doesn't look like it. > > > > There is a fix in trunk now with r1738 and unless something goes wrong with > > it, will be most likely released with 3.1.2 and 3.0.8.
The proposed backport patch for 3.1.x has been updated in the BUG and officially requested for inclusion in 3.1 (beware it includes 1 extra unrelated change that is needed to prevent future conflicts for backporting but that is otherwise mostly irrelevant, specially if making your own package) but also additional changes that simplify the logic and avoid a possible failure of logic which could result in gmetad crashing, so using this newer version is encouraged : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165&action=view > > 3.1.1 is already in testing and since this bug is not a showstopper for that > > specific release, I'd be surprised if the release manager decides it should > > be backported to it, but that shouldn't prevent you patching your own > > package with the proposed patch if you don't want to wait. > > If we are confident enough that the patch for this bug solves the problem I am fairly confident that the patch resolves the problem reported in BUG92 and that matches the description of the problem from Cos, and that is easy to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have proposed it here, in bugzilla and the STATUS file. > then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and > restarting the testing period. The whole point of the testing period is > to flush out problems like this and then determine if the fix is important > enough to retag and retest. agree, but doing so will delay releasing the next version of 3.1 and also indirectly (as I won't start that until 3.1 is out to avoid confusion and overstressing our limited testing resources) the next release for 3.0. 3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical configuration) and a fix for a high bug (instability with tcpconn.py) which had been already backported in fedora and gentoo distribution packages, but not debian, our own packages or anyone using 3.1 from sources (all other architectures where there are yet no packages available) and that are therefore waiting for a 3.1.1 release. > We need some feedback on how serious this problem is for 3.0 will require that you either have another problem as well like : * gmond will hang (either because of another gmond bug or because of a misconfiguration of gmond that has too many or too slow consumers for the number of tcp_accept_channel defined and their configured timeout values and that should most likely affect all the sources as well) * system will hang (either a kernel level deadlock, some security/network guy playing games on the cluster or bad hardware) * a gmond becoming unresponsive because it is collecting data from to many nodes or getting swapped out (this should be moved to another system or segmented further) 3.1 has the same source of problems, except that a misconfiguration in gmond will be most likely to fail because of the increase in the XML payload but only if ganglia is misconfigured (your gmond should be queried from the local LAN from a local gmetad when possible, with hierarchical gmetads able to extend the setup over WANs, or if absolutely needed to be queried over a WAN, a timeout should be configured for the tcp_accept_channel, with -1, which means no timeout and is sometimes recommended for extreme cases). > how confident we are in the fix. the fix was designed as the minimally intrusive change that will accomplish the desired objective without reverting to pre BUG27 scenario, and I expect it to evolve further in the near future as there are still open issues that will need to be addressed around that code like (this can get a little technical and will be better fitted for the ganglia-developer list but is added in here to complete the explanation under this context, future discussion should better moved to the ganglia-developers list): * it is not implemented for gmetad-python yet * it is not consistently implemented as not all failures will skip a source but instead will default to scanning all sources which was the problem that BUG27 was meant to fix as that could generate dips. * it overloads using the "last_good_index" identifier eventhough it doesn't really match the original intention for it as we haven't yet confirmed that the source is working. * I suspect the use of the "dead" identifier is misused (which explains the never ending failure messages that gmetad does) and therefore some refactoring around that code might be needed (which could include adding some other features) * I suspect the initialization for last_good_index might be problematic, but seems to be working now (probably because the whole structure is zeroed before it is used or just out of pure luck and the compiled doing the work for us). * there is also BUG38 and BUG208 that are linked to this section of the code and which are expected to have fixes developed sometime soon. * relates somehow with the wishlist of supporting either compressed payloads in gmond or incremental updates and that had yet no code produced, neither any design (compressed payloads for gmetad or incremental updates most likely to come first anyway) it is important to note as well, that since the fix changes the currently expected behaviour (timeouts or poll errors didn't trigger a failover and now they will) so that misconfigured setups that just happen to work before (except that they sometimes got some dips) will fail now if the other sources are not configured correctly (which wasn't needed before as they were never used anyway). scenarios where I see this happening are : * the sources have different time and so after failover rrd complains loudly of duplicated updates or data gets reported in the wrong time. * the other source selected is deaf and so all data from the cluster is lost fixing those scenarios will require fixing also BUG38 and BUG208 > I would rather throw away 3.1.1 now in favor of a fixed 3.1.2 than half > to do this all over again next month. your call, but if you decide to wait for 3.1.2 then you are going to have to wait 2 months, as next month will be 3.0.8 turn. Carlo ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

