Re: [Ganglia-developers] 2.5.0 experiences

Federico Sacerdoti Tue, 24 Sep 2002 12:33:05 -0700

On Tuesday, September 24, 2002, at 10:49 AM, Steven Wagner wrote:

People have cited disk I/O as a bottleneck. I personally doubt this.If it were true, you'd be seeing random gaps whenever RRD updates camethick and fast (i.e. while all threads were updating RRDs at once), andthe failures should be at least a bit more distributed - not just inone cluster.

So we noticed that disk I/O was a bottleneck on the old version ofgmetad. We had ~400 hosts, each with ~25rrds (incl summaries), 12K each,being updated once every 15 seconds. That's 10,000 file updates every 15seconds, and since their aggregate size was 120MB, it exceeded Linux'sfilesystem cache.

Just by checking the number of disk interrupts we knew disk I/O was aproblem, not to mention the inconsistant-looking graphs. When we put therrds on a ramdisk, things got much better. All which points to disk I/Oas the bottleneck. With gmetad version 2.5.0, things are better, sinceit doesn't keep RRDs for constant metrics, etc, but we still are using amemory-backed filesystem (tmpfs) to store the rrds for good measure. Wehave not tested the new gmetad with a normal (ext2) filesystem, soperhaps our initial problem is gone.

I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN*the same cluster. They support different metric configurations which,unfortunately, don't play well with one another, and depending on theplatform gmetad queries, you may get wildly different results (not tomention funky heartbeats). Individual clusters should be as homogenousas possible, or you should check out metric.h and make sure all themembers are using only the metrics they all understand, and in thatorder.


This is exactly right.


Federico

Rocks Cluster Group, Camp X-Ray, SDSC, San Diego
GPG Fingerprint: 3C5E 47E7 BDF8 C14E ED92  92BB BA86 B2E6 0390 8845

Re: [Ganglia-developers] 2.5.0 experiences

Reply via email to