Re: [Ganglia-general] gmetad bug when gmond host hangs

Carlo Marcelo Arenas Belon Mon, 01 Sep 2008 14:05:47 -0700

On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote:
> >>> On 8/30/2008 at 12:25 AM, in message <[EMAIL PROTECTED]>, Carlo
> Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote:
> > On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote:
> >> Should this have made it into 3.1, or 3.1.1?  It
> >> doesn't look like it.
> > 
> > There is a fix in trunk now with r1738 and unless something goes wrong with
> > it, will be most likely released with 3.1.2 and 3.0.8.


The proposed backport patch for 3.1.x has been updated in the BUG and
officially requested for inclusion in 3.1 (beware it includes 1 extra
unrelated change that is needed to prevent future conflicts for backporting
but that is otherwise mostly irrelevant, specially if making your own package)
but also additional changes that simplify the logic and avoid a possible
failure of logic which could result in gmetad crashing, so using this newer
version is encouraged :

  
http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165&action=view

> > 3.1.1 is already in testing and since this bug is not a showstopper for that
> > specific release, I'd be surprised if the release manager decides it should
> > be backported to it, but that shouldn't prevent you patching your own 
> > package with the proposed patch if you don't want to wait.
> 
> If we are confident enough that the patch for this bug solves the problem

I am fairly confident that the patch resolves the problem reported in BUG92
and that matches the description of the problem from Cos, and that is easy
to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have
proposed it here, in bugzilla and the STATUS file.

> then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and
> restarting the testing period.  The whole point of the testing period is
> to flush out problems like this and then determine if the fix is important
> enough to retag and retest.

agree, but doing so will delay releasing the next version of 3.1 and also
indirectly (as I won't start that until 3.1 is out to avoid confusion and
overstressing our limited testing resources) the next release for 3.0.

3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical
configuration) and a fix for a high bug (instability with tcpconn.py) which
had been already backported in fedora and gentoo distribution packages, but not
debian, our own packages or anyone using 3.1 from sources (all other
architectures where there are yet no packages available) and that are
therefore waiting for a 3.1.1 release.

> We need some feedback on how serious this problem is

for 3.0 will require that you either have another problem as well like :

* gmond will hang (either because of another gmond bug or because of a
  misconfiguration of gmond that has too many or too slow consumers for
  the number of tcp_accept_channel defined and their configured timeout
  values and that should most likely affect all the sources as well)
* system will hang (either a kernel level deadlock, some security/network guy
  playing games on the cluster or bad hardware)
* a gmond becoming unresponsive because it is collecting data from to many
  nodes or getting swapped out (this should be moved to another system or
  segmented further)

3.1 has the same source of problems, except that a misconfiguration in gmond
will be most likely to fail because of the increase in the XML payload but
only if ganglia is misconfigured (your gmond should be queried from the local
LAN from a local gmetad when possible, with hierarchical gmetads able to
extend the setup over WANs, or if absolutely needed to be queried over a WAN,
a timeout should be configured for the tcp_accept_channel, with -1, which
means no timeout and is sometimes recommended for extreme cases).

> how confident we are in the fix.

the fix was designed as the minimally intrusive change that will accomplish the
desired objective without reverting to pre BUG27 scenario, and I expect it to
evolve further in the near future as there are still open issues that will need
to be addressed around that code like (this can get a little technical and will
be better fitted for the ganglia-developer list but is added in here to
complete the explanation under this context, future discussion should better
moved to the ganglia-developers list):

* it is not implemented for gmetad-python yet
* it is not consistently implemented as not all failures will skip a source
  but instead will default to scanning all sources which was the problem that
  BUG27 was meant to fix as that could generate dips.
* it overloads using the "last_good_index" identifier eventhough it doesn't
  really match the original intention for it as we haven't yet confirmed that
  the source is working.
* I suspect the use of the "dead" identifier is misused (which explains the
  never ending failure messages that gmetad does) and therefore some
  refactoring around that code might be needed (which could include adding
  some other features)
* I suspect the initialization for last_good_index might be problematic, but
  seems to be working now (probably because the whole structure is zeroed
  before it is used or just out of pure luck and the compiled doing the work
  for us).
* there is also BUG38 and BUG208 that are linked to this section of the code
  and which are expected to have fixes developed sometime soon.
* relates somehow with the wishlist of supporting either compressed payloads
  in gmond or incremental updates and that had yet no code produced, neither
  any design (compressed payloads for gmetad or incremental updates most
  likely to come first anyway)

it is important to note as well, that since the fix changes the currently
expected behaviour (timeouts or poll errors didn't trigger a failover and now
they will) so that misconfigured setups that just happen to work before (except
that they sometimes got some dips) will fail now if the other sources are
not configured correctly (which wasn't needed before as they were never used
anyway).

scenarios where I see this happening are :

* the sources have different time and so after failover rrd complains loudly
  of duplicated updates or data gets reported in the wrong time.
* the other source selected is deaf and so all data from the cluster is lost

fixing those scenarios will require fixing also BUG38 and BUG208

> I would rather throw away 3.1.1 now in favor of a fixed 3.1.2 than half
> to do this all over again next month.  

your call, but if you decide to wait for 3.1.2 then you are going to have to
wait 2 months, as next month will be 3.0.8 turn.

Carlo

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad bug when gmond host hangs

Reply via email to