Re: [Ganglia-developers] More info about the tcpconn latency issue...

Carlo Marcelo Arenas Belon Fri, 11 Jul 2008 12:17:14 -0700

On Fri, Jul 11, 2008 at 12:24:01PM -0600, Brad Nicholes wrote:
> >>> On 7/11/2008 at 11:15 AM, in message <[EMAIL PROTECTED]>, Carlo
> Marcelo Arenas Belon <[EMAIL PROTECTED]> wrote:
> 
> I guess I would just rather see it distributed so that the user can decide
> what they want to do rather than us making the decision for them.


and I agree with you on that, the only difference of opinions comes on how
to distribute that and if that is feasible now (see below).

> > My suggestion was to make a file name change as well into the contrib
> > directory, where it won't get in the way and will be also available for
> > those that want to use it, but since there is no contrib yet distributed
> > then cleanly removing it (it will be available from our repository in
> > the web anyway for whoever wants to install it) looks like the best next
> > option.
> 
> I would agree as well if we had a contrib/ directory.  But just because
> we don't should not mean that we remove it completely and make it
> unavailable for those that would still like to use it.

there is also the possibility of just adding the "contrib" into this first
release and using instead that (which should be safe enough) and has been
already voted for backport (but for the next release).

feel free to commit that then and base disabling this metric / documentation
on the contrib directory which should satisfy all raised concerns.

if you are going that route, it might be also a good idea to backport
including ganglia-rrd-modify.pl into the contrib which has been approved also
and was dependent on that first backport.

but if you are going that route (and this is where this starts becoming a
risky proposition) is that would be also nice to backport the original
python 2.4 compatible version which doesn't have the problem the 2.3
compatible version has and that would be a better fit for the majority of
the users (except for the ones stuck with python 2.3 like CentOS 4 users
and that have other problems getting ganglia running as well, like the lack
of an APR1 official package they could use as a dependency), but then that
version doesn't exist yet (even if it will be easy to come up with as you
explained by rolling back the 2.3 compatibility patches) and hasn't been
tested probably as much as the buggy one.

> >> It still works reliably, it just has a wait timeout issue that is really
> >> only noticeable when using the -m parameter.
> > 
> > but that would result in some metric samples failing silently and therefore
> > in some wholes in the RRD values that could then result in mysterious drops
> > in the graphs or flat lines.
> 
> No and the reason why is because the actual gathering of the data is
> threaded.  tcpconn.py spins up its own gathering thread that periodically
> exec's netstat and updates an internal array of metrics.
> When the gmond main thread requests the metrics, all it does is read the
> internal array and return whatever the last gathered value was.

Ok, but then that spawning netstat thread will randomly fail, an so
depending on the frequency it fails compared with the polling gmond does
you will get flat lines.

> There is no delay to gmond at all.  At worst, the tcpconn gathering thread
> might delay occasionally which has no effect on anything else.  It was
> written this way on purpose so that gmond would never be at the mercy of
> the python exec code, netstat delays in execution or OS delays.

Good to know, and definitely a sound architectural design.

> The delay only shows up for gmond when the tcpconn metric_clean() function
> is called and the main gmond process has to wait for the tcpconn gathering
> thread to shutdown.  That's why you see the delay in with the -m parameter
> and no where else.

Well, as you explained you also see it at shutdown.

> The gmond -m option causes the metric_init(), which starts the gathering
> thread and the metric_cleanup() which shuts down the gathering thread,
> to happen one immediately after the other.  Gmond has to delay waiting
> for the thread cleanup.

And this is IMHO a bug, but a fix for it is not something that will be ready
to release anytime soon as spelled in the STATUS file.

It would be better if the metric_init() doesn't initialize the "spawning
netstat thread" but leave that to the collection method that is scheduled
by gmond and who would just need to do the first sample and initialize
that thread the first time it is called.

This way the metric_cleanup() method won't need to wait either for the `gmond
-m` case which shouldn't execute any metric collection code in principle.

Carlo

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Re: [Ganglia-developers] More info about the tcpconn latency issue...

Reply via email to