On Tuesday, September 24, 2002, at 10:49 AM, Steven Wagner wrote:
People have cited disk I/O as a bottleneck. I personally doubt this. If it were true, you'd be seeing random gaps whenever RRD updates came thick and fast (i.e. while all threads were updating RRDs at once), and the failures should be at least a bit more distributed - not just in one cluster.
So we noticed that disk I/O was a bottleneck on the old version of gmetad. We had ~400 hosts, each with ~25rrds (incl summaries), 12K each, being updated once every 15 seconds. That's 10,000 file updates every 15 seconds, and since their aggregate size was 120MB, it exceeded Linux's filesystem cache.
Just by checking the number of disk interrupts we knew disk I/O was a problem, not to mention the inconsistant-looking graphs. When we put the rrds on a ramdisk, things got much better. All which points to disk I/O as the bottleneck. With gmetad version 2.5.0, things are better, since it doesn't keep RRDs for constant metrics, etc, but we still are using a memory-backed filesystem (tmpfs) to store the rrds for good measure. We have not tested the new gmetad with a normal (ext2) filesystem, so perhaps our initial problem is gone.
I really hope you aren't mixing Linux, FreeBSD and IRIX nodes *WITHIN* the same cluster. They support different metric configurations which, unfortunately, don't play well with one another, and depending on the platform gmetad queries, you may get wildly different results (not to mention funky heartbeats). Individual clusters should be as homogenous as possible, or you should check out metric.h and make sure all the members are using only the metrics they all understand, and in that order.
This is exactly right. Federico Rocks Cluster Group, Camp X-Ray, SDSC, San Diego GPG Fingerprint: 3C5E 47E7 BDF8 C14E ED92 92BB BA86 B2E6 0390 8845
