On Fri, Aug 08, 2008 at 11:53:02AM +0100, [EMAIL PROTECTED] wrote:
> > Configuration management is always a challenge, but lucky for 
> > us ganglia doesn't have to do any of that because it is a 
> > cluster monitoring tool that can run without a configuration.

First, want to be completely clear that I don't think the current solution
covers for ALL possible cases and therefore there shouldn't be alternatives
and indeed I wouldn't be surprised it doesn't really work with the current
code either as it used before, just because most of the people managing big
clusters had already implemented a better one.

The point I am trying to make is that the current solution could be made to
work for simple cases and is non intrusive and a more complex solution doesn't
need to be tied to gmond to work and if it is done through gmond will need to
be designed in a way that is non intrusive and will be likely to require more
development resources than I am personally willing to give (since there are
other tools that can be used to do it outside ganglia and we have more
pressing problems to solve that are more related to cluster monitoring than
this one).

luckily this is opensource and so you are free to disagree and probe me wrong,
so make your code do the talk.

> What happens when you have multiple clusters?

As I explained before, if every cluster has its own VLAN (which is a common
setup in HPC) then it just works because even if they all use the same
multicast address, the TTL is 1 and therefore it will never leave that VLAN
and is therefore local (you could consider it a L2 broadcast message).

Practically speaking I'd seen some switches get very confused in these
scenario (IOS bugs and bad configurations) and that is why I personally
don't recommend using multicast if possible (which is core to this solution
working) but the fact that it works and the ingenuity of the solution speaks
of the genius behind its design.

> Each node needs to know
> which cluster name and multicast address to use.

gmetad can override the cluster name (there is a parameter indicating the data
source for that reason in gmetad.conf) so that you don't really need to have a
different cluster name on each gmond.conf (haven't tested that for the last
code since I really have no way to have working multicast in my test setup
anyway)

> In a huge
> organisation, it is not feasible for every node to join the same
> multicast group.  In some clusters, not all nodes are on a shared
> multicast segment.  So the default configuration is useful, but it is
> not a solution for everyone.

agree, if you have clusters spanning multiple VLANs (which might be a design
mistake too) then this setup breaks or force you to enable multicast routing
between VLANs and to be careful about how to configure your switches (sparse
mode and support is a plus here or you'll have over utilized trunks and high
CPU utilization in your switches)

> > you are missing half the problem here, as there also has to 
> > be a way to trigger gmond to reload its configuration (now a 
> > restart is required) and that means that there has to be also 
> > a secure way to instruct remote gmonds to restart themselves 
> > (ssh, puppet or cfengine are used for this).
> 
> Other management mechanisms may already be in place to restart any
> arbitrary service or restart a whole machine.  Such mechanisms don't
> always have the ability to deploy configuration files.

fair, but a solution that is distributed with ganglia can't assume such
mechanism exists and leave the users trying to complete that part of their
puzzle.

as soon as you allow them to get a configuration from a central server (like
ldap) then you have to also provide for a mechanism for configuration changes
to propagate and for nodes to be notified of those configuration changes so
they can reload them in a secure way.

> > a web service that generates a configuration on demand is 
> > usually easier to scale and maintain.
> 
> Fetching the configuration file from a web server is quite a valid
> solution - and as you point out, the file can be cached locally.  In
> contrast, the benefit of LDAP is that it can enforce some structure, it
> can integrate with an existing LDAP deployment, and it can be examined
> using a range of query tools.

if you really feel so strongly about LDAP (which means you had never seen how
ugly that can get) then knock yourself out and do it, just please keep open
the possibility for using some other mechanism (like http).

guess something like :

  --config-url=ldap://server:389/dc=example,dc=com??sub?(cn=${HOSTNAME})

could be used to do that and easily be changed to support also http (as far as
the key used can be replaced dynamically based on some template.

having a second configuration file that could include this information might
be useful as an alternative to using command line options, and might even be
for a different process that keeps running as root and who is responsible to
spawn/restart gmond when needed and handle all the notifications for this
configuration management, ideally this process will also recreate gmond.conf
as needed (so that it can be used by gmetric) and for caching the
configuration to avoid unneeded single point of failures and help with the
scalability of the configuration server.

> > > Whether the configuration server uses LDAP or something else, how 
> > > should it be found?  Here are some ideas I had:
> > > - a configuration option for hard-wiring the configuration server 
> > > hostname
> > 
> > then you have to handle two different configuration formats 
> > in the application for the configuration file (or hopefully 
> > two configuration files) and you are back into square 1 with 
> > "configuration management" issues.
> 
> Not quite - the configuration server address will be static for a whole
> organisation, whereas parameters like the cluster name will vary from
> one host to the next

nope, you can't have every node hitting only 1 configuration server and hope
to scale that efficiently, you can't either assume that all nodes will be able
to contact that one server (due to firewalls or network latency) and therefore
you will end up having more than one and having to manage that configuration
somehow as well (even if that is less likely to change)

> I would be quite happy having the configuration server hostname
> specified in my init script inside the RPM, Solaris package,  etc, as it
> only has to be inserted there once and never changed.  For redundancy,
> DNS round-robin would probably be sufficient.

unless things had changed since, ldap client libraries used to resolve the
names only once and so they are not "DNS round-robin" friendly.  indeed LDAP
client libraries liked to have multiple servers given to them so they can use
them for failover (which was also very likely not to work as expected), in
the bright side load balancing LDAP with a hardware load balancer is possible
but healthchecking might not be that useful depending on the brand and version
of the load balancer you had available.

Carlo

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to