matt massie wrote:
steve-

you can see that the only metadata i've put in right now is plugin name, author and version (see test-plugin.c ganglia_main()). i welcome any ideas of more metadata that we need.


right now, i envision that we will have a service plugin directory (say /var/lib/ganglia/services). every file in that directory will be loaded at gmond (gserviced?) startup. each of the service plugins will then load up all the data collect/publish plugins that it needs. keep in mind that a collect/publish module is not restricted on the number of metrics it can process.

So a service plug-in, in this particular case, would be something along the lines of "CPU monitoring," and would load a percentage collector, a number-of-CPUs collector, possibly a temperature collector, etc. ?

I assume these plugins only have to be initialized once. Are we thinking about what happens when a monitored component is added/removed while the daemon is running? I'm not talking CPUs here, but think about a disk monitor.

You have your (probably Linux-specific) disk monitor service. This checks out your attached devices and loads things like a SMART status monitor plug-in, a filesystem-per-disk metric, and so forth. Ganglia runs for a while. Then the RAID array's taken offline to be rebuilt, or another one is added.

Do the service or collector plug-ins support some form of messaging/event model that would allow this to happen during the course of normal operation or would this involve some sort of SIGHUP-style daemon-kicking?

It's entirely possible that an individual collector could notice something that requires a rescan by the other collectors in that service (the SCSI monitor notices a new disk just got added to the array and sends a "rescan" event to its parent disk monitoring service, to use the example above).

This same framework could allow an enterprising individual to write a notifier front-end that sends SNMP traps, e-mails, smoke signals, or updates a display on the front of the box when certain events occur.

you can see how i changed the job scheduler. each job has a job-specific collect and publish function now (see g3_job_t in g3.h). i needed to have both functions in each job (instead of linking them) in order for us to have multiple service frontends.

That's what I saw yesterday, and it makes sense to me. But is each job associated with a single metric? Will a plug-in be able to share data between its instances?

What I'm getting at is, if you have a job for monitoring each mounted local filesystem, and they all use xfs_monitor.so, and there *isn't* a shared memory location for them all to stash the most recent results, then you're polling $NUMBER_OF_PARTITIONS times more often than you need to be. Which is programmatically gross and in some time-sensitive environs could be construed as bad.

And if each job resolves to a plug-in, and it's up to the plug-in to make the metrics ... hmmm, I guess that answers all the questions that I've actually raised up to this point. DOH! Except about the event model.

this also allows us to have push AND pull methods for publishing metrics.

This will make Lester very happy. :)

you'll see inetaddr.c tcp.c udp.c and mcast.c in the distribution now. g3 will have a full multicast, udp and tcp library to use in building these services. i've compiled and tested the networking library on Linux, Solaris, FreeBSD, Cygwin and MacOS X.

When there's a front-end ready, *that's* when I'll start getting excited.

Is there any reason to make a g3 metadaemon? Wouldn't it be possible to implement this as one or more front-end/service plug-ins?

.. i got off track there ..

back to the plugin question... if a plugin is compiled on a different platform than the one trying to load it then dlopen() will fail and we won't even be able to get at the meta-data.

woohoo!

i think the question you are hitting on is this.. what is the best approach to building the plugins: platform-specific or metric-specific? platform-specific means that a developer builds a plugin which only works on a single target platform but has many metrics (this is more like our approach in the past).. OR.. should have a a metric-specific plugin (say load) which only measures a single metric but works across a range of platforms. i think the first approach is best... ..

I think a combination is best, actually. There are some POSIX-y things out there that we can monitor on anything. Not a lot, but it's something. Enough to encourage people to write their own stuff.

I'm talking about something like a uname plugin that works on a pretty wide range of systems. The MTU value, as well. There are a few instances in the machine/*.c code where we've reinvented the wheel in several shapes and sizes. It would be nice to eliminate that.

But I do think that we should either arbitrarily decide for our own private purposes or publicly state the baseline metrics that we're working towards for all supported platforms. It doesn't seem to be very publicly-known that Ganglia's metric output varies by platform. Maybe for g3 we should make a pretty chart that shows the metrics supported per platform...

.. you know.. i just realized that i'm rambling on and on.. if you find anything useful in this message.. please feel free to reply..

Rambling is what developer lists are for!


Reply via email to