Yesterday, Steven Wagner wrote forth saying...
> I can see I'm going to have to drop the microphone mathematics.
>
> matt massie wrote:
> > so i'm pretty certain g3 will be a pure xml beast. no more xdr messages
> > on the wire. here's my thinking on this...in no necessary order..
>
> I'm going to shock you by saying I don't like this. I know, you're asking
> yourself why someone who's watching a dual-processor E420R take over 10
> seconds to parse a 3.6MB gmetad output is against the idea of using more
> XML elsewhere in the program design.
3.6mb of ganglia data is lots o' data.. without custom metrics that info
on about 1000 nodes. your problem doesn't really lie with the speed of
the php xml parser alone but also on the fact that strcmp() are being
performed on all tags and attributes and the entire tree is being written
to PHP data structures. there is no doubt gmetad now is very limited.
> It's very portable, I'm not arguing that point. On the monitoring cores I
> am worried about speed and CPU cycles - I want the monitoring core to be
> very high in one respect, very low in the other.
>
> [insert joke here.]
it's a vacation holiday weekend right now but next week i'll get some
benchmark numbers for you to quantify in CPU and bandwidth each approach.
we should also consider extensibility and encapulation.
with xml we can easily extend attributes or tags and older versions of
gmetad will keep right on processes the portion of the xml that it
understands and ignore any new tags/attrs.
as far as encapsulation...
with gmetad3, i think the right approach is to have it pull the entire
data source tree over once and then maintain a persistent connection which
plugs it directly into the data source multicast group. the data source
be able to easily forward messages on...
gmond on Host A sends a multicast (or whatever) mu (/cpu/number) .. and
Host B (acting as a data source) simply expands the path to
Cluster A::Host A:cpu:number and send the message up to line.
these persistent connections as important because.. think about gmetad
now.. it is polling data every n secs (by default n=15). some of the data
it is polling only gets updated every hour or every day. why pull it over
the wire and write it to RRDs if we don't need to?
each message would have the time step and so the RRD would be created and
updated especially for each part of the metric tree.
we can extend and encapsulate xdr data as well but i argue not quite as
easily as with xml.
> Still think we could try sending metrics out in an XDR table with a
> hashed-up value for "metric name" which corresponds to an entry in a
> previously-transmitted metric attribute lookup table... keeps the
> transmitted data simple, after all.
federico and i have talked until we're blue in the face about the whole
separating metric attributes from values on the wire. i really really
really wanted that... but fed and i came to the conclusion that it not
very simple to do using group messaging (multicast)... since multicast is
not reliable. what happens if a value is received and the attributes are
not? i don't think it's a huge hit on the network to explicitly send both
together. it's the only way to guarantee they both arrive over unreliable
channels ... and streams oriented designs are hard to scale to huge
clusters even though they give us the reliablility we need.
[insert thinking here while dealing with a screaming baby]
i'm thinking more about this.. and using ascii.. we might be able to
separate the attributes from the values. in the past .. for example.. we
needed to know that xdr data type was being sent.. a string, uint8, int32,
etc. well.. using straight xml ascii we only just send/save the data as
text. we don't need to know the data type.. it is always text. the only
time the data type is important is later when we want to act on the data
(e.g. sort it)... and we really only need the type in C. PHP, Python,
Perl use implicit typing based on the string... so..
i have an idea of how we can do that! i think. let me know what you
think.
since we have a modular design now and each gmond module registers itself
for a particular portion of the metric tree.. then.. at registration time
the module also registers all metric attributes. the attributes should in
general be constant across the entire cluster. that means that only
key/value pair need to be sent on the wire.. and all attributes as
implicit. for situations where attributes are not constant over the
cluster or the metric has not been measures.. we fall back to the explicit
message format (thanks to the extensibility of xml). i'm sure this will
work and the g3 code i have now is easily modified to incorporate that
feature.
that means that registered metrics would only need to have...
<b n="cpu"><m n="number" v="2"/><m n="system" v="12.5"/></b>
that's 60 bytes of data.. just as an excercise.. let's see what this would
take in XDR using xdr_array and xdr_string...
xdr_string for the branch = 4+4 = 8 bytes
xdr_array for number of metric strings = 4 bytes
xdr_string for "number" and value "2" = 8+4*3 = 20 bytes
xdr_string for "system" and value "12.5" = 8+4*3 = 20 bytes
for a total of 52 bytes... it's close but for this example xdr is a little
smaller. it'll be interesting to see the difference in CPU/speed
processing the two. xdr will take a hit on small endian boxes (e.g. x86)
but not on big endians like sun boxes.
[back to separating attributes from values]
when gmond exports xml it would export the explicit attributes without
assuming the remote cluster has the same module installed/registered.
i really like this model.. it'll save network and CPU with no cost. we
just get the attributes locally when we can.
> Could it win a bake-off against a similarly tuned XDR method? In terms of
> speed, CPU and scalability?
bake off coming soon.. after i enjoy this three day weekend. :)
> Oh. Right. My idea.
>
> A metric pipelining plug-in with multicast and unicast support. The
> plug-in would have to be configured with a list of nodes that it's
> responsible for (or an entire cluster - maybe we could just use URLs?) and
> a reporting interval for each. Just like the metadaemon, in reverse.
> Every interval seconds, it transmits the appropriate chunk of metrics in
> XML to its configured destination. On receiving the metric chunk, it's
> treated just as if it had originated locally, and gets re-transmitted over
> the locally-configured multicast channel (obviously this only works if we
> *don't* break the pipelined data into individual metric chunks).
>
> This would actually increase Ganglia scalability (at the price of some
> latency over pipelined links) because it allows a finer degree of control
> over multicast traffic, and each individual node in a very large cluster
> doesn't have to deal with 50,000 small packets per second being firehosed
> at it (instead it's dealing with a few thousand larger packets closer to
> the MTU value).
>
> I can see that being a lot of fun for slow links... heck, after releasing
> the source it should only be a matter of time before people turn that into
> a notifier plug-in. :)
>
> OK, that's all for now, I think...
so what you're saying i think is that you'd like to have a method of
explicit triggering remote hosts to express particular metric values?
as far as triggers... i was thinking we'll have explicit boundaries for
each metric... as an attribute. for example
<mu name="memory">
<metric name="free" value="1345" min="100"/>
</mu>
OR
<mu name="disk">
<mu name="/dev/hda2">
<metric name="fragmentation" value="3.2" max="10"/>
</mu>
</mu>
in the first case.. if ever the amount of free memory drops below 100 kbs
then an alert is triggered... in the second.. if /dev/hda2 ever becomes
more than 10 fragmented an alert is sent.
if we put the message on the write in XML, a perl alert daemon could be
written using XML::Parser in about 100 lines of code. :)
have a great weekend...
-matt