[Ganglia-general] Fwd: [Beowulf] Performance metrics & reporting

Bernard Li Tue, 08 Apr 2008 16:37:28 -0700

I never found gmond to be resource intensive.  Sure there has been
memory leaks (that have since been fixed) but I doubt this is what
Donald is referring to.

Perhaps he is talking about gmetad.  Most users know that if they have
a lot of hosts (1000+) they are monitoring, they need to do something
about the RRDTool I/O -- the most common solution is to put the rrds
in tmpfs.  We talked about making this more transparent to users in
future releases.  After such a workaround is implemented, gmetad does
not use a lot of resource either.

So I'd like to ask the Ganglia community -- do you guys find Ganglia
to be a resource hog?

Cheers,

Bernard

---------- Forwarded message ----------
From: Donald Becker <[EMAIL PROTECTED]>
Date: Tue, Apr 8, 2008 at 2:32 PM
Subject: Re: [Beowulf] Performance metrics & reporting
To: Beowulf Mailing List <[EMAIL PROTECTED]>

 On Tue, 8 Apr 2008, Jesse Becker wrote:
 > Gerry Creager wrote:
 > > Yeah, we're using Ganglia.  It's a good start, but not complete...
 >
 > The next version of Ganglia (3.1.x) is being written to be much more easy to
 > customize, both on the backend metric collection by allowing custom modules
 > for gmond, and on the frontend with some changes to make custom
reports easier
 > to write.  I've written a small pair of routines to monitor SGE jobs, for
 > example, and it could easily be extended to watch multiple queues.

 It might be useful to consider what we did in the Scyld cluster system.

 We found that a significant number of customers (and potential customers)
 were using Ganglia, or were planning on using it.  But those that were
 intensively using it complained about its resource usage.  In some cases
 it was using 20% of CPU time.

 We have a design philosophy of running nothing on the compute nodes except
 for the application.  A pure philosophy doesn't always fit with a working
 system, so from the beginning we built in a system called BeoStat (Beowulf
 State, Status and Statistics).  To keep the "pure" appearance of our
 system we initially hid this in BeoBoot, so that it started immediately at
 boot time, underneath the rest of the system.

 How are these two related?  To implement Ganglia we just chopped out the
 underlying layers (which spend a huge amount of time generating then
 parsing XML), and generate the final XML directly from the BeoStat
 statistics already combined on the master.

 This gave us the best of both worlds: no additional load on compute nodes,
 lower network load, much higher efficiency, and easy scalability to
 thousands of nodes from BeoStat, and the ability to log and summarize
 historical data, good-looking displays and ability to monitor multiple
 clusters from Ganglia.

 It might be useful to look at the design of Beostat.  It's superficially
 similar to other systems out there, but we made decisions that are
 much different than others -- ones that most consider wrong until they
 understand their value.

 Some of them are:
  It's not extensible
  It reports values in a binary structure
  It's UDP unicast to a single master machine
  It has no liveness criteria
  The receive side stores only current values

 The first one is the most uncommon.  Beostat is not extensible.  You can't
 add in your own stat entries.  You can't have it report stats from 64
 cores.  It reports what it reports... that's it.

 Why is this important?  We want to deploy cluster systems.  Not build a
 one-off cluster.  We want the stats to be the same on every system we
 deploy.  We want every tool that uses the stats to be able to know that
 they will be available.  Once you allow and encourage a customizable
 system, every deployment will be different.  Tools won't work out of the
 box, and there is a good chance that tools will require mutually
 incompatible extensions.

 Deploying a fixed-content stat system also enforces discipline.  We
 carefully considered what we need to report, and how to report it.  In
 contrast look at Ganglia's stats.  Why did they choose the set they did?
 Pretty clearly because the underlying kernel reported those values.  What
 do they mean?  The XML DTD doesn't tell you.  You have to look at the
 source code.  What do you use them for?  They don't know, they'll figure
 it out later.

 People next question "but what if I have 8/16/64 cores?  You only have 2
 [[ now 4 ]] CPU stat slots."  The answer is similar to above -- what are
 you going to do with all of that data?  The answer is summarize it before
 using it.  We just summarize it on the reporting side.  We report that
 there are N CPUs, the overall load average, and then summarize the CPU
 cores as groups (e.g. per socket).  For network adapters we report e.g.
 eth0, eth1, eth2 and "all the rest added together".

 Once we chose a fixed set of stats, we had a ability to make it a fixed
 size report.  It could be reported as binary values, with
 any per-kernel-version variation done on the sending side.

 Having a small, limited-size report meant that it fit in a single network
 packet.  That makes the network load predictable and very scalable.
 It gave us the opportunity to effectively use UDP to report, without
 fragmenting into multiple frames.  UDP means that we can switch to and
 from multicast without changes, even changing in real time.

 A fixed-size frame makes the receiving side simple as well.  We just
 receive the incoming network frame into memory.  No parsing, no
 translation, no interpretation.  We actually do a tiny bit more, such as
 putting on a timestamp, but overall the receiving process does only
 trivial work.  This is important when the receiver is the master, which
 could end up with the heaviest workload if the system isn't carefully
 designed.  We've support 1000+ machines for years, and are now designing
 around 10K nodes.

 We actually do a tiny bit more when storing a stat packet -- we add a
 timestamp.  We can use this to figure out the time skew between the master
 and computer node, verify the network reliability/load, and to decide if
 the node is live.

 This isn't the only liveness test.  It's not even the primary liveness
 test.  We document it as only a guideline.  Developers should use the
 underlying cluster management system to decide if a node has died.  But if
 there hasn't been a recent report, a scheduler should avoid using the node.
 Classifying the world into Live and Dead is wrong.  It's at least Live,
 Dead and Schrodinger's Still-boxed Cat

 Finally, this is a State, Status and Statistics system.  It's a
 scoreboard, not a history book.  We keep only two values, the last two
 received.  That gives us the current info, and the ability to calculate
 rate.  If any subsystem needs older values (very few do) it can pick a
 logging, summarization and coalescing approach of its own.

 We made many other innovative architectural decisions when designing the
 system, such as publishing the stats as a read-only shared memory version.
 But this are less interesting because no one disagrees with them ;-).

 --
 Donald Becker                           [EMAIL PROTECTED]
 Penguin Computing / Scyld Software
 www.penguincomputing.com                www.scyld.com
 Annapolis MD and San Francisco CA

 _______________________________________________
 Beowulf mailing list, [EMAIL PROTECTED]
 To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Fwd: [Beowulf] Performance metrics & reporting

Reply via email to