I like your comments. Here are some counter-questions :)
On Friday, August 30, 2002, at 02:34 PM, Steven Wagner wrote:
It seems to me this would also make the "DSO-ification" of the
monitoring core a smoother process, not to mention a cleaner one from
the standpoint of those developing the DSO's. :)
Good point.
I was thinking of "yet another hash" that has a hashed-up number based
on the name or hierarchy position of the metric as a key. The idea
being, this number is shorter than using the fully-qualified name of
the metric all the time.
So instead of encoding "cpu.idle" we encode 0x03FA450A and that field's
50% shorter (even better if we get to
"processes.top.1.cpu_percentage"), and only have to multicast the real
string name once. The hierarchical information is stored (as a
pointer, at the very least) in this hash.
What's really going to be key here is not so much the idea of making the
statically-#define'd metric hash dynamic, but keeping it up to date...
If we go far enough in this it'll look like SNMP, only more
collaborative. :)
So I am thinking that sending the fully-qualified metric name (as shown
above) is a better idea now - it handles failures more effectively. When
a node comes up it would receive metrics that look like
"host1/cpu/cache/size" (fully-qualified with all the metric's ancestors)
instead of "cache/size" (relative as I had suggested previously). This
fits in with Steven's idea of hosts being authoritative for branches
they created - each metric specifies its branches explicitly. It also
reduces reliance on an elder node for the branch hierarchy.
This way a node can easily create branches as needed for any metric it
receives.
About the "hash for storing fully qualified metric names (FQMN :)". How
would we populate such a hash? At some level, the metric must specify
its fully-qualified name, so we know where to put it. A hash value is no
good if we don't already have the name stored. How would you handle new
metrics? I think we could runlength-encode the name strings to save
space if we need to, but having each metric carry its full name seems
clearer to me.
I imagine a hash_find(node, "cpu", "cache") function that takes a
variable number of arguments to locate the hash table to insert a given
metric in (the metric here: host1/cpu/cache/size). The 'node' argument
specifies the root of the metric tree - the node hash table for host1.
Note each branch would get it's own hash table so that hash_foreach()
will work correctly and printing the XML will be easy.
To make this work, we simply add a 'hash_t *branch' member to the
metric_data_t structure. If branch==NULL then we are a leaf (actual
metric), else this is a branch that points to another hash table. I can
visualize the XML output code now...
Dense, yes, but the area of metrics is just about the only one in the
Ganglia design that *doesn't* scale well (kudos, Matt & co.). I'm sure
that we can work this out if we just keep banging those rocks
together. :)
Clever ;)
Do people like the java-like dot notation for hierarchical names, like
"host1.cpu.cache.size", or the unix filesystem forward-slash notation:
"host1/cpu/cache/size". I like the slashes because its easy to tell if
you're talking about a leaf or a branch: "host1/cpu/" is clearly a
branch, while "host1.cpu." is a little harder to read. But either way
would work.
I'm psyched about this change, and I am ready to dive in right after the
2.5.0 release.
-Federico
Rocks Cluster Group, Camp X-Ray, SDSC, San Diego
GPG Fingerprint: 3C5E 47E7 BDF8 C14E ED92 92BB BA86 B2E6 0390 8845