i just uploaded another snapshot of ganglia
http://matt-massie.com/g3/ganglia-3.0.0.tar.gz
i've been working on polishing the xml before i start work on the
s-expressions side of things (since sexpressions will just be condensed
versions of the xml).
the latest snapshot has the ability to timestamp all data in the xml tree.
run ./tests/test-tree to see a sample. all timestamps are in 8601 format
zulu time.
here is what i propose for the xml... i'm looking for comments .. things
i'm missing.. things you don't like... etc.
all elements of the xml will be tagged with timestamp attributes: "birth"
"age" "step" and "expires". birth is a 8601 timestamp in zulu time of
when the data was created/inserted into the set. age is the number of
secs elapsed since birth. step is the maximum number of seconds between
(re)births [the time threshold] and expires is the number of seconds
after birth that the data will cease to exist.
e.g.
... birth="2003-04-24T23:00:18Z" age="46" step="60" expires="3600" ...
the birth date is in zulu time to handle timezone differences. the
library (as i have it written now) handles the time attributes as such...
if a tag being captured to the data structure has time attributes.. they
are not modified but instead or just put into the data structure. if
however, no time attributes exist, they are created. this behavior is
especially important for gmetad allowing entire portions of the
organization structure (grids, clusters, etc) to be marked as out-of-date
(and even allow for their removal after a period of time). it also
prevents updates of the freshness of a tag simply because it is read
upstream.
the three organizational elements (ou,host,mu) will simply an index
attribute "id" and these time values.
e.g.
<ou id="California" birth=".." age=".." step=".." expires="..">
<ou id="Berkeley" birth=".." ...>
<ou id="UCBerkeley" birth=".." ...>
<host id="www" birth=".." ...>
<mu id="network" birth=".." ...>
<metric ...>
</mu>
</host>
</ou>
</ou>
</ou>
this is not much different than 2.x except for the host tag. in 2.x the
host tag had ip and location attributes. i think it makes more sense to
put those in the host data tree.
<host id="foo" birth="..." ...>
<mu id="network" ...>
<metric id="ip" value="1.1.1.1" ...>
</mu>
<mu id="gps" ...>
<metric id="longitude" value=".." ...>
<metric id="latitude" value=".." ...>
<metric id="altitude" value=".." ...>
</mu>
<mu id="rack" ..>
<metric id="x" value=".." ..>
<metric id="y" value=".." ..>
<metric id="z" value=".." ..>
</mu>
</host>
or something like that.
i would like to make the organizational tags as generic as possible and
have the specific attributes fall into the tree under the host.
the metric tag has the following attributes...
<metric id=".." birth.. etc ..
and the following attributes
type = (string|number|float|bignum)
all number will be held as long ints, all float will be held as
doubles,.. the bignum will be for any numbers over the maximum
long value. the bignum libraries that i've seen are based on text
strings instead of binary representations so they are slower. but
i only expect them to be needed by daemons far upstream of the
data which means they will mostly be summarizing. basically
trading detailed information for managing large numbers.
trend = (constant|gauge|counter)
(like the old slope tag...)
max = n
(absolute maximum value this metric can have)
min = n
(absolute minimum value this metric can have)
alert_max = n
(send an alert if the value is greater than n)
alert_min = n
(send an alert if the value is less than n)
samples = n
(the number of samples in the calculation)
reduce = (sum/avg) any others?
(the function used to reduce values for summary.. really important
for alerts of summary data but little else since we could just
use the value and samples. the default is sum.)
so
here is a sample of the xml
<ou id="Grid A" birth="2003-04-25T17:16:53Z" age="34" step="60">
<ou id="Cluster A" birth="2003-04-25T17:17:39Z" age="52" step="60">
<mu id="cpu" birth="2003-04-25T17:17:39Z" age="34" step="120"
units="%s" type="float" trend="gauge" min="0" max="100"
samples="128" reduce="avg">
<metric id="user" value="2.0" alert_max="95.0"/>
<metric id="nice" value="0.3"/>
<metric id="system" value="5.3"/>
<metric id="idle" value="92.4"/>
</mu>
<host id="bio10" birth="2003-04-25T17:18:45Z" age="4" step="15">
<mu id="cpu" birth="2003-04-25T17:19:55Z" age="10" step="15"
units="%" type="float" trend="gauge" min="0" max="100">
<metric id="user" value="12.3" alert_max="95.0"/>
<metric id="nice" value="10.2"/>
<metric id="system" value=" 0.2"/>
<metric id="idle" value="77.3"/>
</mu>
</host>
<host id="...">
... many more hosts ...
</host>
</ou>
</ou>
make sense? comments please.
-matt