On Sat, Aug 30, 2008 at 11:00:40AM -0400, Ofer Inbar wrote:
> Carlo Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote:
> > > | ------- Additional Comment #2 From Timothy Witham 2008-06-05 10:38
> > > | My patch still loses if we are talking to a gmond affected by Bug#38.
> > > | In that case, we receive incomplete data, but since it is some data,
> > > | we keep talking to that host every time.  Maybe we should just talk to
> > > | a random host every time.  Better to fix Bug#38 though...
> > >
> > > That would have the effect of gmetad
> > > getting incomplete, old data from that gmond, but that seems to be a
> > > different problem.
> > 
> > yes, and that is why it has a different bug.
> 
> What bug is that?

the part with the bug number was snipped out of this email, but will be
reintroduced again below.

> I'm not talking about fixing gmond, I'm talking about having gmetad
> sense when gmond from one meber of the cluster is giving old &
> incomplete data, and trying another cluster member to see if it can
> get newer data.  I didn't see a bug for that, I just saw a note in the
> "timeout poll" bug mentioned that a solution for it wouldn't handle
> that situation, and saying that's okay, that *should* be separate.

fair, and you are right I should had probably put a comment there with the
reference to the "enhancement" bug as well, but since I put it in the email, I
thought it wasn't needed anyway since most interesting parties were hopefully
reading this thread or CC in the new bug.  It has been updated now anyway.

> Is there indeed a bug, or a plan by someone, to get gmetad to do this?

basically the enhancement request to do some sort of intelligent load
balancing between sources that is now being tracked in BUG208

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=208

> > > Told solve it in gmetad, we'd want gmetad to have
> > > some way of judging whether the data it's getting from a gmond is
> > > fresh and current, which is not the same as judging whether it
> > > actually *got* the data from the gmond.
> 
> > the problematic code was introduced as a fix for BUG27 and was
> > indeed trying to detect if gmond was able to use the source or not
> > by looking at the obvious lack of TCP connectivity.  BUG92 showed
> > that the heuristic was incomplete because didn't include gmond/system
> > that are hung but still responsive to a TCP three way handshake
> 
> That was the *original* discussion on this thread, and that is indeed
> what the "timeout poll" bug is trying to address.  What I'm saying is
> in reference to Timothy Witham's last note, which is a rather different
> matter.

I was trying to explain how the code evolved and how this problem was
introduced as well as the proposed solutions to hopefully come forward in the
next 2 years of development cycle ;)

Previous to BUG27 (around 2.5.7 or the first releases of 3.0) gmetad did
failover to all sources sequentially until getting valid data, problem of
course (as explained in that bug) is that when the first source for your
datasource was down, then there was an increased delay that resulted in
gaps as the time needed to get data from that datasource increased
proportionally to the number of sources that were down at the beginning
of the datasource (which can get even worst when some security guy decides
to install a firewall that drops packets when misconfigured)

Timothy's solution to BUG92 somehow reintroduces that failover mechanism
by just choosing the source randomly but still doesn't address the healthcheck
issue which results in sometimes selecting the bad source and therefore
results in more gaps.

BUG208 will hopefully address all the current issues by implementing a real
load balancing solution for sources which also does health checking of some
sort (still to be designed) to ensure that "bad" sources aren't used and
marked down or in some cases correcting the result (as when accounting for
time drifts between sources)

> So I'm confused by your response - not that anything you say
> is confusing, I'm just confused by how it relates to what you
> responded to?

don't worry, I am used to have people confused with my responses, and
I sometimes get myself in long email threads repeating the same concept
again and again in different ways just because of that.

but as Cassandra said recently the trick is in reading and re-reading the
message until it finally clicks or as you did, ask for some clarification,
at least until I can figure out how to use punctuation correctly because
learning English from BASIC left me with some strange aversion to semicolons
and to use dashes inside words.

Carlo

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to