On Wed, Sep 03, 2008 at 09:39:57AM +0100, [EMAIL PROTECTED] wrote:
> 
> Here's a little something we discovered by accident: if you move some of
> the module definitions into files in /etc/ganglia/conf.d/something.conf,
> but forget to remove them from /etc/ganglia/gmond.conf, then Ganglia
> tries to initialise the same module twice.

a similar situation presume happens when you compiled your modules statically
(--enable-static-build) and are using by mistake a configuration that
instruct gmond to load the module.

in both cases we sadly don't do much checking and assume configuration should
be OK and try to load the code generating havoc as you reported.

the solution to this bug will be (like apache does) to check the symbol table
first for the object we are planning to insert and if it is already there just
print a warning and skip loading it.

> The result is that on the second initialisation, the module uses up all
> the machines memory and then the process crashes.

this might be a reflection of another bug, probably a memory leak we have
which is just being amplified by the previous problem, as I would expect the
memory allocated for the module to be returned when the module fails to load
because the dynamic linker finds a conflicting object and aborts.

> I observed this first with one of my own modules, and then reproduced it
> with the cpu module.

BTW, the problem is not in using different configuration files, but the fact
that in the configuration you have 2 entries for the same module, which could
be in the same file next to each other as well.

> On Solaris, the process dies - on Linux, the whole box has gone down.

Linux is misconfigured there, as no errant process should be able to take a
system down, sadly though this is just a common case of linux misconfiguration
(linux haters will say it is a design issue) where distributions just try to
be conservative and don't adequately protect you from "fork bombs" or in this
case "fast memory leaks", the dreaded OOM killer could help here or setting
some sane limits (usually in /etc/security/limits.conf) as well as VM tunning

> Should it be the module developer who detects this condition, or the
> module loader code?

module loader code but for now is the user responsibility to have a sane
configuration.

if you meant the "module developed code" should check about that, I'll think
than other than cleanly removing all allocate memory at shutdown, there is not
much that can be done at that point.

also I am presuming you reproduced this problem with 3.1.0 as I did, or was
this a report for 3.1.1 testing going bad?, in any case and even if it is not
3.1.1 specific getting a fix for this sooner than later might be a good idea,
but I will defer Brad to make that decision, as code has yet to be produced to
fix this and 3.1.1 is an important milestone at least as a starting point for
people being able to deploy 3.1 in production with some confidence.

Carlo

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to