Ryan Sweet wrote:
On Tue, 24 Sep 2002, Federico Sacerdoti wrote:
Just by checking the number of disk interrupts we knew disk I/O was a
problem, not to mention the inconsistant-looking graphs. When we put the
At the moment disk I/O isn't the problem, though I see how it could be
once the rest of the systems get added. What method are you using for
backing up the rrd's from tmpfs? rsync?
Uhhhhh yeah, rsync, that's the ticket...
[Just a cron script that runs every 15 minutes - losing 15 minutes' worth
of data isn't going to freak anybody out over here, since we aren't using
this for accounting purposes, just general health. After we reach full
deployment I'll try tweaking the frequency of updates... ]
I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN*
the same cluster.
That's _precisely_ what I'm doing.
You know, it's a good thing we aren't developing open-source proton packs...
The compute clusters are linux nodes, FreeBSD gateways. Then the network
where I'm having trouble is the workstation network for the engineers,
which is a grab bag of 32bit, 64bit, IRIX/Linux/*BSD (one of the things I
want to help with asap is getting OpenBSD to build).
OpenBSD doesn't build, eh? How odd... I guess *BSD don't have as much in
common as I thought.
I don't quite understand why this is (or needs be) a problem. Shouldn't
the gmonds just hash and multicast all the metrics they receive,
regardless of whether it is a metric that its own host is capable of
storing. It seemed to work this way in principle, with 2.4.1. I have a
set of custom metrics (see my topusers.pl in the gmetric scripts) that are
per users, and thus by nature not on each machine. These for the most
part work great... it is a really good way to see usage patterns across
the network and to pin resource usagee on the users responsible in a graph
that the managers can understand. I used to use nasty hackish perl
scripts to create graphs from sar reports, which were never as accurate
anyway. I much prefer ganglia in this regard.
Custom user metrics are actually given a metric hash value of 0 and are
stored in a different manner. And don't blame me, it was like this when I
got here. I refer you to the SF archives of this list for my numerous
whiny e-mails about metric handling.
And Ganglia has worked that way since at least 2.3.x, which is about when I
started taking an interest in it. It doesn't "need" to be that way, that's
just the way it is... for now.
Man, I wish I could use something that simple to track jobs here.
Unfortunately, to quote the great Keanu Reeves, "It's not that simple.
It's complicated."