matt massie wrote:
steve-
you can see that the only metadata i've put in right now is plugin name,
author and version (see test-plugin.c ganglia_main()). i welcome any
ideas of more metadata that we need.
right now, i envision that we will have a service plugin directory (say
/var/lib/ganglia/services). every file in that directory will be loaded
at gmond (gserviced?) startup. each of the service plugins will then load
up all the data collect/publish plugins that it needs. keep in mind that
a collect/publish module is not restricted on the number of metrics it can
process.
So a service plug-in, in this particular case, would be something along the
lines of "CPU monitoring," and would load a percentage collector, a
number-of-CPUs collector, possibly a temperature collector, etc. ?
I assume these plugins only have to be initialized once. Are we thinking
about what happens when a monitored component is added/removed while the
daemon is running? I'm not talking CPUs here, but think about a disk monitor.
You have your (probably Linux-specific) disk monitor service. This checks
out your attached devices and loads things like a SMART status monitor
plug-in, a filesystem-per-disk metric, and so forth. Ganglia runs for a
while. Then the RAID array's taken offline to be rebuilt, or another one
is added.
Do the service or collector plug-ins support some form of messaging/event
model that would allow this to happen during the course of normal operation
or would this involve some sort of SIGHUP-style daemon-kicking?
It's entirely possible that an individual collector could notice something
that requires a rescan by the other collectors in that service (the SCSI
monitor notices a new disk just got added to the array and sends a "rescan"
event to its parent disk monitoring service, to use the example above).
This same framework could allow an enterprising individual to write a
notifier front-end that sends SNMP traps, e-mails, smoke signals, or
updates a display on the front of the box when certain events occur.
you can see how i changed the job scheduler. each job has a job-specific
collect and publish function now (see g3_job_t in g3.h). i needed to have
both functions in each job (instead of linking them) in order for us to
have multiple service frontends.
That's what I saw yesterday, and it makes sense to me. But is each job
associated with a single metric? Will a plug-in be able to share data
between its instances?
What I'm getting at is, if you have a job for monitoring each mounted local
filesystem, and they all use xfs_monitor.so, and there *isn't* a shared
memory location for them all to stash the most recent results, then you're
polling $NUMBER_OF_PARTITIONS times more often than you need to be. Which
is programmatically gross and in some time-sensitive environs could be
construed as bad.
And if each job resolves to a plug-in, and it's up to the plug-in to make
the metrics ... hmmm, I guess that answers all the questions that I've
actually raised up to this point. DOH! Except about the event model.
this also allows us to have push AND pull methods for publishing metrics.
This will make Lester very happy. :)
you'll see inetaddr.c tcp.c udp.c and mcast.c in the distribution now.
g3 will have a full multicast, udp and tcp library to use in building
these services. i've compiled and tested the networking library on Linux,
Solaris, FreeBSD, Cygwin and MacOS X.
When there's a front-end ready, *that's* when I'll start getting excited.
Is there any reason to make a g3 metadaemon? Wouldn't it be possible to
implement this as one or more front-end/service plug-ins?
.. i got off track there ..
back to the plugin question... if a plugin is compiled on a different
platform than the one trying to load it then dlopen() will fail and we
won't even be able to get at the meta-data.
woohoo!
i think the question you are hitting on is this.. what is the best
approach to building the plugins: platform-specific or metric-specific?
platform-specific means that a developer builds a plugin which only works
on a single target platform but has many metrics (this is more like our
approach in the past).. OR.. should have a a metric-specific plugin (say
load) which only measures a single metric but works across a range of
platforms. i think the first approach is best... ..
I think a combination is best, actually. There are some POSIX-y things out
there that we can monitor on anything. Not a lot, but it's something.
Enough to encourage people to write their own stuff.
I'm talking about something like a uname plugin that works on a pretty wide
range of systems. The MTU value, as well. There are a few instances in
the machine/*.c code where we've reinvented the wheel in several shapes and
sizes. It would be nice to eliminate that.
But I do think that we should either arbitrarily decide for our own private
purposes or publicly state the baseline metrics that we're working
towards for all supported platforms. It doesn't seem to be very
publicly-known that Ganglia's metric output varies by platform. Maybe for
g3 we should make a pretty chart that shows the metrics supported per
platform...
.. you know.. i just realized that i'm rambling on and on.. if you find
anything useful in this message.. please feel free to reply..
Rambling is what developer lists are for!